Phishing datasets | Cyber Attack And Def

Phishing Email Detection Dataset

The Kaggle notebook “Phishing Email Detection” by Farel Arden loads the Phishing_Email dataset, performs basic data cleaning and class distribution analysis, and balances the dataset by undersampling the majority class (safe emails). It then converts the email text into numerical features using TF-IDF vectorization and trains multiple supervised machine learning pipelines (including Random Forest, Support Vector Machine, and XGBoost) to classify emails as phishing or legitimate. The code evaluates model performance using train/test splits, accuracy metrics, confusion matrices, and cross-validation, and concludes with a sample prediction on a custom email to demonstrate real-world phishing detection capability.

The Phishing Emails dataset available on Kaggle was released in 2022 by Subhadeep Chakraborty and is designed for cybersecurity and natural language processing research focused on phishing detection. The dataset consists of over 18,000 email messages that primarily include the textual content of emails alongside a binary classification label identifying each message as either phishing or legitimate. Its structure supports supervised machine learning tasks by providing labeled email text for training and evaluating detection models, and it is widely used in academic studies examining linguistic patterns, social engineering cues, and automated classification methods for identifying malicious email communications.

Code 1

Dataset Summary

Code Using the Dataset

The Kaggle notebook “Phishing Email Detection Using Deep Learning” by Kirollos Ashraf downloads the Phishing_Emails dataset, performs data cleaning (removing nulls, duplicates, and unnecessary columns), and analyzes class distribution through visualizations. The code preprocesses the email text using tokenization and sequence padding, encodes the labels, and then builds multiple deep learning models—including RNN, LSTM, GRU, and Bidirectional architectures—using TensorFlow/Keras to classify emails as phishing or legitimate. Model performance is evaluated using train/test splits and accuracy metrics, demonstrating how neural network–based text representations improve phishing detection compared to traditional machine learning approaches.

Code 2

The Kaggle notebook “Phishing Email Detection Using SVM & RFC” by Elnahas loads the Phishing_Email dataset, conducts preliminary data exploration and cleaning, and preprocesses the email text for machine learning by converting it into numerical features using TF-IDF vectorization. The code then trains and compares two supervised classification models—Support Vector Machine (SVM) and Random Forest Classifier (RFC)—to distinguish between phishing and legitimate emails. Model performance is evaluated using a train/test split along with accuracy scores, classification reports, and confusion matrices, allowing the notebook to assess and compare how traditional machine learning algorithms perform on text-based phishing detection tasks.

Phishing Site URLs Dataset

Dataset

Code 1

Dataset Summary

The Phishing Site URLs dataset on Kaggle was created and published by Tarun Tiwari and is commonly cited in research as originating around 2020. It is a large-scale cybersecurity dataset containing approximately 549,346 URL entries labeled to indicate whether each website link is phishing (malicious) or legitimate, making it suitable for supervised machine learning classification tasks in phishing detection. The dataset primarily consists of two main columns—one containing the raw website URL text and another serving as the label (e.g., good/benign vs bad/phishing)—and is widely used in URL-based threat detection studies because it focuses on lexical and structural characteristics of web links rather than email content or attachments.

Code 2

Code 3

The Kaggle notebook “Phishing Detect NLP N-gram XGBoost EDA” by Vitor Gama Lemos loads the Phishing Site URLs dataset, performs exploratory data analysis (EDA) to examine class distribution and URL characteristics, and preprocesses the URL text by cleaning and transforming it into numerical features using TF-IDF vectorization with n-grams. The code then trains an XGBoost classification model to distinguish between phishing and legitimate URLs, using a train/test split for evaluation and assessing performance through accuracy and classification metrics. Additionally, the notebook emphasizes feature engineering with n-gram representations of URL strings and includes visualizations to better understand dataset patterns before model training, demonstrating an NLP-based approach to URL-level phishing detection.

Code 1

Code Using the Dataset

The Kaggle notebook “Phishing Site Prediction” by Ashish Kumar Behera loads the Phishing Site URLs dataset, cleans and explores the data (including checking class balance and basic descriptive statistics), and preprocesses URL strings so they can be learned by machine learning models. It converts URLs into numerical representations (typically via tokenization/TF-IDF style vectorization of URL text) and then trains supervised classifiers to predict whether a URL is phishing or legitimate. The notebook evaluates model performance using a train/test split and standard classification metrics (e.g., accuracy and a classification report/confusion matrix), demonstrating a practical end-to-end pipeline for URL-based phishing detection from raw links to final predictions.

Code 2

The Kaggle notebook “Phishing Sites Detector – Complete Info” by Tarun Tiwari loads the Phishing Site URLs dataset, performs detailed exploratory data analysis to understand class distribution and dataset characteristics, and preprocesses the URL data by cleaning and converting it into machine-readable numerical features. The code then applies multiple supervised machine learning models (such as Logistic Regression, Random Forest, and other classifiers) to detect phishing versus legitimate URLs, using feature extraction techniques suitable for textual URL patterns. Model performance is evaluated through train/test splitting, accuracy scores, confusion matrices, and classification reports, providing a comprehensive end-to-end pipeline for URL-based phishing site detection and comparative model analysis.

Code 3

Phishing Email Dataset

Dataset

Code 1

Dataset Summary

The Phishing Email Dataset on Kaggle was created by Naser Abdullah Alam and Amith Khandakar and is referenced in academic sources as a publicly available dataset released on Kaggle around 2023–2024. The dataset is a large, multi-source email corpus compiled from several well-known collections such as Enron, SpamAssassin, Nazario, and Nigerian fraud emails, and contains tens of thousands of email samples labeled as phishing/spam or legitimate. Its contents typically include structured email attributes such as sender, receiver, subject, body text, dates, and URLs, making it suitable for supervised machine learning and deep learning research on phishing email detection, linguistic analysis, and automated cybersecurity classification tasks.

Code 2

Code 3

Code Using the Dataset

Code 1

Code 2

The Kaggle notebook “Spam Email Classification Model Comparison” by Moawwaz Tahir loads the phishing email dataset, performs data cleaning and preprocessing (including handling missing values and preparing the email text fields), and explores class distribution through basic visual analysis. The code then converts the email text into numerical features using vectorization techniques such as TF-IDF and trains multiple supervised machine learning models—including Naive Bayes, Logistic Regression, and Support Vector Machine—to classify emails as spam/phishing or legitimate. It evaluates and compares model performance using a train/test split along with accuracy scores, classification reports, and confusion matrices, highlighting the effectiveness of different traditional machine learning algorithms for email-based phishing and spam detection.

The Kaggle notebook “BERT for Phishing Email Classification” by Ivan Pilashev loads the phishing email dataset, performs preprocessing on the email text (including cleaning, tokenization, and label encoding), and prepares the data for deep learning using transformer-based methods. The code utilizes a pre-trained BERT model to generate contextual text embeddings and fine-tunes it on labeled email data to classify messages as phishing or legitimate. Model training is conducted using a train/test split with validation monitoring, and performance is evaluated through accuracy and classification metrics, demonstrating a deep learning approach that leverages contextual language understanding for more advanced phishing email detection compared to traditional machine learning models.

Code 3

The Kaggle notebook “Phishing Email Analysis and Classification” by Muhammad Roshaan Riaz loads the phishing email dataset, conducts exploratory data analysis to examine class distribution and textual patterns, and preprocesses the email content through cleaning, tokenization, and vectorization to prepare it for machine learning. The code then trains multiple supervised classification models on the processed email text to distinguish phishing emails from legitimate ones, using a train/test split for evaluation. Model performance is assessed with accuracy scores, confusion matrices, and classification reports, providing an end-to-end pipeline that combines data analysis, feature extraction, and comparative model evaluation for phishing email detection.