XSS datasets & projects | Cyber Attack And Def

Cross site scripting XSS Dataset for Deep learning

The Kaggle notebook “XSS Detection by Machine Learning” by Prince Roy loads the Cross-Site Scripting (XSS) Dataset for Deep Learning, uses the Sentence field as input text and Label as the binary target, and converts the payload strings into numeric features using a CountVectorizer with English stopwords (min_df=2, max_df=0.8). It then performs an 80/20 train–test split and trains several supervised classifiers—Logistic Regression, AdaBoost (100 estimators), Gaussian Naive Bayes, XGBoost (100 estimators), and a Decision Tree (entropy criterion)—evaluating each model with accuracy and F1 score, and additionally computing confusion-matrix–based sensitivity, specificity, precision, and recall to compare how well each approach detects XSS payloads versus benign inputs.

The Cross-Site Scripting (XSS) Dataset for Deep Learning on Kaggle was published by Syed Saqlain Hussain Shah and appears on his Kaggle profile as being uploaded around ~2020 as it was updated 6 years ago. The dataset is a labeled text corpus for supervised XSS detection, containing 13,686 payload-like “sentences” with two classes (XSS vs not-XSS/benign) and is reported in research usage notes as being compiled from sources such as PortSwigger materials and OWASP cheat sheets to capture diverse XSS vectors for model training and evaluation.

Code 1

Dataset Summary

Code Using the Dataset

The Kaggle notebook “XSS Detection CNN” by Hing Phan loads the Cross-Site Scripting (XSS) Dataset for Deep Learning, keeps only the text and label columns, and converts each payload string into a fixed-size 100×100 grayscale “ASCII image” representation (by mapping characters to ASCII values, resizing with OpenCV, and normalizing). It then splits the data into train/validation/test sets and trains a TensorFlow/Keras 2D-CNN (three Conv2D + MaxPooling blocks followed by dense layers with a sigmoid output) for binary classification of XSS vs benign samples. The notebook evaluates performance on the test set using a classification report and confusion matrix, and also computes additional rates such as TPR, FPR, and FNR after thresholding predicted probabilities at 0.5.

Code 2

The Kaggle notebook “XSS Detection” by Thoàn Đặng loads the Cross-Site Scripting (XSS) Dataset for Deep Learning, cleans and balances the data by sampling equal numbers of XSS and non-XSS examples, and transforms the payload text into numerical features using TF-IDF vectorization. It then trains and compares multiple supervised machine learning classifiers—including Logistic Regression, Linear SVM, and Naive Bayes-style models—to classify inputs as XSS or benign, evaluating performance with a train/test split using metrics such as accuracy, confusion matrices, and classification reports to measure how well each model detects XSS payload patterns.

XSS attacks Dataset

Dataset

Code 1

Dataset Summary

The XSS Attacks Dataset on Kaggle was published by Saurabh Shahane and is shown on Kaggle as being updated ~4 years ago (≈2022). It is a small, labeled dataset provided as a single CSV file (460 rows and 7 columns) intended for supervised cross-site scripting (XSS) detection tasks. The dataset’s contents consist of XSS-related samples (payload-style strings used to represent XSS attempts) paired with a label indicating whether each entry is an XSS attack versus benign/non-XSS, making it suitable for training and evaluating machine learning classifiers for XSS detection.

Code 2

Code 3

The Kaggle notebook “Cyber Security – XSS attack – 5 models” by Sohom Majumder uses the XSS attacks Dataset and builds a straightforward supervised learning comparison pipeline. The code loads the spreadsheet into Pandas, label-encodes all categorical/object columns, inspects feature–label correlations, and plots basic feature histograms for exploratory analysis. It then evaluates five classifiers—Logistic Regression, K-Nearest Neighbors, Decision Tree (CART), SVM, and Gaussian Naive Bayes—using Repeated Stratified K-Fold cross-validation (10 folds × 3 repeats) with accuracy as the scoring metric, printing each model’s mean/stdev accuracy and visualizing the results with a boxplot for side-by-side performance comparison.

Code 1

Code Using the Dataset

The Kaggle notebook “Normal Analysis” by Mohamed ElSayed Qamar loads the XSS Attacks Dataset and performs a descriptive exploratory analysis rather than model training. The code reads the dataset into Pandas, inspects its structure, and computes unique counts for key fields (e.g., API Name, App Names, Website Name, Permissions, and Label). It then analyzes and visualizes the top permissions (by splitting comma-separated permissions and counting them), API call frequency, most frequent websites, and geographic distribution (from the Location field) using Seaborn bar plots, and it also plots the Label distribution with a countplot while checking for missing values in core columns.

Code 2

The Kaggle notebook “Analysis Using Algorithm” by Mohamed ElSayed Qamar loads the XSS Attacks Dataset, performs preprocessing by handling missing values and converting categorical fields (such as app name, website, permissions, API name, IP, and location) into numeric form via label encoding, and then builds supervised machine learning models to classify the dataset’s Label (attack vs non-attack). The code trains and evaluates several classifiers (including common baseline models such as Logistic Regression, KNN, Decision Tree, SVM, and Naive Bayes) using a train/test split and standard performance metrics like accuracy, a confusion matrix, and a classification report, providing an end-to-end pipeline for comparing algorithms on the XSS-labeled records.

Code 3

Cross-Site Scripting Projects

Project 1

Project 2

Project 1

The deep-xss project by DAS Lab is a research-focused project that builds an XSS payload detector using deep learning. It treats XSS strings like short pieces of text: first turning payload tokens into numerical representations using a word2vec-style embedding, then feeding the resulting sequences into an LSTM recurrent neural network to classify whether a payload is malicious XSS or not. The repo is set up to support training/testing this model on real payload data (it includes a large CSV payload file) and is tied to the authors’ 2018 paper, where they report high precision/recall on their dataset and present the approach as an alternative to brittle rule-based filtering for catching obfuscated or novel XSS patterns.

Project 3

Project 4

Project 2

Project 3

The ML-XSS-Detection project by Orlando Barrera II is a small, practical machine-learning demo that focuses on classifying text samples as XSS vs. benign. The repo centers around a Jupyter notebook (and a small Python script) that walks through building an ML pipeline where XSS-like strings are treated as “documents,” converted into numeric vectors using a Doc2Vec-style text representation, and then fed into a standard classifier to predict whether an input resembles an XSS payload. It’s designed more as a reproducible learning project than a full production tool: you can follow the notebook to see how the data is processed, how features are created from raw strings, and how the trained model is used to label new samples.

The RandomForest-Thesis project by Ali Raza Lilani is a thesis-style, notebook-driven implementation of an ML detector for web attack strings, focusing on Cross-Site Scripting (XSS) and SQL injection. It provides a complete workflow using a Random Forest classifier: it loads two CSV files of labeled examples (a “bad” set containing attack strings and a “good” set containing legitimate inputs), performs preprocessing, trains the model, and evaluates performance with basic visualizations. The repo also includes a second notebook that trains an LSTM model as a comparison/validation approach on the same dataset, making it useful if you want to contrast a classic ensemble model against a sequence-based deep learning model for the same detection task.

Project 4

The xss-detector project by firedragonironfist is a Python-based XSS detection tool that wraps a machine-learning (deep learning) model in a package you can use from multiple entry points. It’s set up to scan for XSS across raw text, URLs, files, and even structured HTTP request components (params/headers/cookies), and it provides both a command-line interface for quick checks and a REST API server for integrating detection into other tools or pipelines. A notable design choice is that it can automatically download a dataset and train a model on first use, while also exposing a manual training command so you can retrain or refresh the model when needed—making it more “drop-in” than notebook-only demo repos.