Ransomware datasets | Cyber Attack And Def

UGRansome Dataset

The Kaggle notebook “Ransom Analysis Using ML and DL” by Sripad Karthik loads the UGRansome dataset, performs data preprocessing including cleaning, handling missing values, label encoding, and normalization of structured network traffic features related to ransomware and anomalous behavior. The code conducts exploratory data analysis to understand feature distributions and attack patterns, then applies both traditional machine learning models and deep learning approaches to classify normal versus ransomware-related traffic. Model performance is evaluated using a train/test split along with accuracy, loss curves, and classification metrics, presenting a hybrid pipeline that compares ML and DL methods for ransomware and anomaly detection in network traffic data.

The UGRansome Dataset on Kaggle was created by Dr. Mike Wa Nkongolo and was originally developed in 2021 as a cybersecurity dataset designed for detecting ransomware, zero-day attacks, and anomalous network behavior. It contains structured network traffic data that includes both normal and malicious activities, with multiple attributes representing behavioral and statistical patterns of network flows, enabling classification of threats such as botnet activity, spam, port scanning, and ransomware-related anomalies. The dataset is specifically built for intrusion detection and anomaly detection research, and it is widely used in machine learning and deep learning studies to identify modern cyber threats that are not detectable through traditional signature-based methods, particularly focusing on advanced persistent threats and zero-day ransomware scenarios.

Code 1

Dataset Summary

Code Using the Dataset

The Kaggle notebook “UGRansome Machine Learning Python Notebook” by Dr. Mike Wa Nkongolo loads the UGRansome dataset, performs basic exploratory analysis to understand the dataset structure and class distribution, and preprocesses the structured network-traffic features through cleaning (e.g., handling missing values), label encoding, and scaling to make the data suitable for modeling. It then trains and evaluates multiple supervised machine learning classifiers to distinguish benign traffic from ransomware/attack-related behavior, using a train/test split and standard evaluation outputs such as accuracy, confusion matrices, and classification reports. Overall, the notebook demonstrates an end-to-end ML workflow for ransomware and anomaly detection using flow-based network features rather than text-based inputs.

Code 2

The Kaggle notebook “UGRansome LLM (BERT)” by Ntando Yenkosi Ndlovu loads the UGRansome dataset, renames/standardizes columns, and then converts each tabular network-flow row into a single natural-language “Text” string (e.g., combining fields like protocol, flags, family, bytes, threats, and port). It encodes the target labels into integers, splits the data into train/test sets, and fine-tunes a BERT sequence-classification model (Hugging Face Trainer) to predict the ransomware/threat class from the generated text representation. The notebook also integrates Weights & Biases (wandb) for experiment tracking and uses Captum Integrated Gradients to provide interpretability by highlighting which parts of the generated text contribute most to the model’s predictions.

ransomware detection Dataset

Dataset

Code 1

Dataset Summary

The Ransomware Detection Data Set on Kaggle was published by Amdjed Bensalah and, based on Kaggle’s dataset listing (“updated 3 years ago”), its public release is best reported as ~2023. The dataset is a labeled, tabular ransomware-detection corpus intended for supervised ML, containing features extracted from Windows executables and/or their observable activity; Kaggle describes it as features extracted from Windows Portable Executable (PE) files for distinguishing benign vs malicious (ransomware) samples, while some academic uses of the same Kaggle dataset describe the features as reflecting process actions, file modifications, and network activity with corresponding labels.

Code 2

Code 3

The Kaggle notebook “XGBoost Ransomware Detection and Classification” by Amdjed Bensalah loads the ransomware detection dataset, performs basic dataset inspection (file listing, shapes, and quick viewing of features/labels), and prepares the input matrix by selecting feature columns (skipping the first two columns and using the last column as the target label). It then trains an XGBoost (XGBClassifier) model, evaluates it using repeated stratified k-fold cross-validation (reporting mean accuracy and standard deviation), fits the model on a train split, and generates test-set predictions that are displayed alongside the true labels. Finally, the notebook runs an additional experiment that compares different numbers of trees (n_estimators) in XGBoost using cross-validation and visualizes the performance distribution with a boxplot.

Code 1

Code Using the Dataset

The Kaggle notebook “Ransomware Detection Using ANN” by Mohammed Sulaiman loads the ransomware detection dataset, performs preprocessing steps such as removing identifier columns (e.g., file name and hash), converting selected categorical attributes into numerical codes, handling duplicates, and exporting a cleaned version of the dataset. After initial dataset preparation, the code generates a synthetic classification dataset using scikit-learn to demonstrate model training, then builds and trains an Artificial Neural Network (ANN) using the MLPClassifier to classify ransomware versus benign samples. The model is evaluated using a train/test split along with accuracy, confusion matrix visualization, and a classification report, illustrating a neural network–based pipeline for ransomware detection and performance assessment.

Code 2

The Kaggle notebook “RandomForest Ransomware Detection and Classification” by Mohamed Cherif Bousserouel loads the ransomware detection dataset, performs data preprocessing including inspecting the dataset structure, handling feature selection, and separating input features from the target label. The code then splits the data into training and testing sets and trains a Random Forest classifier to distinguish between ransomware and benign samples based on the extracted executable and behavioral features. Model performance is evaluated using accuracy, confusion matrix, and classification metrics, presenting a traditional machine learning pipeline for ransomware detection that emphasizes ensemble learning on structured malware feature data.

Code 3

Ransomware Attacks Dataset

Dataset

Code 1

Dataset Summary

The Ransomware Attacks dataset on Kaggle was released in 2021 by Joakim Arvidsson and compiles roughly ~360 notable ransomware incidents documented from public reporting. Its contents are an event-style table describing each attack with fields such as the target/victim, industry/sector, organization size, the ransom demand/amount, and whether the ransom was paid, making it useful for descriptive analytics and risk-trend modeling of real-world ransomware campaigns rather than malware-feature classification.

Code 2

Code 3

Code Using the Dataset

Code 1

Code 2

The Kaggle notebook “Ransom Paid Predict – CatBoost + SHAP” by Dima loads the Ransomware Attacks dataset, performs data cleaning and preprocessing such as handling missing values, encoding categorical variables (e.g., industry, country, and organization attributes), and selecting relevant features related to ransomware incidents. The code then trains a CatBoost classifier to predict whether a ransom was paid based on attack-related characteristics, using a train/test split for model validation. Model performance is evaluated with accuracy and classification metrics, and the notebook further applies SHAP (SHapley Additive Explanations) to interpret feature importance and explain how different factors influence the model’s ransom payment predictions, creating an interpretable machine learning pipeline for ransomware incident analysis.

The Kaggle notebook “Ransomware Attacks Data Import” by Joakim Arvidsson focuses on loading and preparing the Ransomware Attacks dataset for analysis by importing the CSV files, inspecting the dataset structure, and performing basic cleaning such as handling missing values and formatting categorical fields related to ransomware incidents. The code primarily conducts exploratory steps, including viewing columns like victim organization, industry, ransom demand, and payment status, to understand the dataset’s composition rather than building predictive models. Overall, the notebook serves as a data preparation and exploratory analysis workflow that organizes real-world ransomware incident records for subsequent statistical analysis or machine learning tasks.

Code 3

The Kaggle notebook “Ransomware Attacks EDA” by Noureddine H loads the Ransomware Attacks dataset and performs exploratory data analysis to understand patterns in real-world ransomware incidents. The code imports and cleans the dataset, examines key variables such as victim industry, organization size, country, ransom demand, and payment status, and uses visualizations to analyze trends and distributions across attacks. It generates summary statistics and plots to identify which sectors are most targeted, how ransom amounts vary, and how frequently ransoms are paid, focusing entirely on descriptive analysis rather than building predictive machine learning models.