SQL-Injection-Extend Dataset
The Kaggle notebook “SQL Injection Detection v0” by Alex Trinity builds a character-level deep learning classifier for SQL injection detection using the SQLInjectionExtend dataset. It cleans the data by removing missing labels/sentences, converts each query string into a fixed-length (1000) sequence of character indices using a custom alphabet, and splits the data into train/validation/test sets. The notebook defines several alternative neural architectures and ultimately trains a TensorFlow/Keras 1D-CNN model (stacked Conv1D + max-pooling layers followed by dense layers with a sigmoid output) for binary classification, then evaluates performance with accuracy/precision/recall and runs extra sanity checks to measure false positives on benign sentences (e.g., apostrophes/“select” phrases) and detection robustness on common SQLi patterns.
The SQL-Injection-Extend dataset on Kaggle was published by alextrinity as an extended corpus of SQL queries intended for SQL injection attack detection research, and although Kaggle does not display a precise publication year on the dataset page, it is widely referenced in machine learning research around 2024–2025 as a public benchmark for training SQLi detectors. The dataset consists of a large collection of raw SQL query strings labeled as malicious (SQL injection) or benign, typically stored in a CSV file (e.g., sqli-extended.csv around 50 MB) and formatted so each row contains the query text and a corresponding label for classification. It is designed for supervised learning tasks in cybersecurity to build and evaluate detection models that distinguish SQL injection attacks from normal database queries based on text and pattern features.
Code 1
Dataset Summary
Code Using the Dataset
The Kaggle notebook “BiLSTM” by non177 loads the SQLInjectionExtend dataset and prepares raw SQL query strings for deep learning classification. The code cleans the data, encodes binary labels (benign vs SQL injection), and tokenizes the query text into numerical sequences, applying padding to ensure uniform input length. It then builds a Bidirectional LSTM (BiLSTM) neural network using TensorFlow/Keras, allowing the model to capture contextual dependencies in both forward and backward directions within SQL query text. The model is trained using a train/test split and evaluated with standard classification metrics such as accuracy and loss, demonstrating a sequence-based deep learning approach for detecting SQL injection attacks from raw query patterns.
Code 2
Code 3
The Kaggle notebook “Graduation Project 2” by Rowan Gomaa loads the SQLInjectionExtend dataset, performs basic data cleaning and exploratory analysis (including checking class distribution and removing duplicates), and preprocesses SQL query text using TF-IDF vectorization to convert raw queries into numerical features. The code then trains and compares multiple supervised machine learning models—including Decision Tree, Random Forest, Multinomial Naive Bayes, XGBoost, and AdaBoost—to classify queries as benign or SQL injection. Model performance is evaluated using a train/test split along with accuracy scores, confusion matrices, and classification reports, providing a comparative machine learning pipeline for SQL injection detection.
SQL Injection Dataset
Dataset Summary
The SQL Injection Dataset on Kaggle was published by Sajid Ali and is commonly referenced as being released around 2021. The dataset contains approximately 30,000+ SQL query samples labeled for binary classification, where each row includes a raw SQL query string and a corresponding label indicating whether the query is benign (0) or a SQL injection attack (1). It is designed for supervised machine learning and deep learning research in cybersecurity, enabling the development of text-based detection models that identify malicious SQL injection patterns within database queries.
The Kaggle notebook “5 Machine Learning Model Using SQL Analysis” by Md. Ismiel Hossen Abir loads the SQL Injection Dataset, performs data inspection and preprocessing on the SQL query text, and converts the queries into numerical features using TF-IDF vectorization. The code then trains and compares five supervised machine learning classifiers—Logistic Regression, Support Vector Machine (SVM), Random Forest, Decision Tree, and Multinomial Naive Bayes—to classify queries as benign or SQL injection. Model performance is evaluated using a train/test split along with accuracy scores and classification metrics, providing a comparative machine learning pipeline for SQL injection detection based on textual query patterns.
Code 1
Code Using the Dataset
The Kaggle notebook “SQL Injection Detection” by Omar Farooq loads the SQL Injection Dataset, uses the Query column as input text and Label as the binary target, and preprocesses queries by tokenizing them with a Keras Tokenizer (top 10,000 words) and padding sequences to a fixed length (100). It then trains a 1D CNN deep learning model (Embedding → Conv1D → MaxPooling → Dropout → Dense → Sigmoid) using an 80/20 train-test split with an additional validation split during training, evaluates results with a classification report, and finally demonstrates real-world usage by predicting whether several example queries (including classic SQLi patterns like ' OR 1=1 --) are malicious or safe.
Code 2
The Kaggle notebook “Finetuning BERT with Triplet/Contrastive Loss” by Kaggle user Rohan K loads the SQL Injection Dataset, samples 10,000 labeled queries, and splits them into train/test sets. It fine-tunes a SentenceTransformer DistilBERT model (distilbert-base-nli-mean-tokens) using a triplet-loss objective (BatchAllTripletLoss) so SQL queries with the same label learn closer embeddings while different-label queries are pushed apart, then compares original vs fine-tuned embeddings using t-SNE visualizations. Finally, it uses the fine-tuned embeddings as features to train a Logistic Regression classifier and reports performance with accuracy, precision, recall, F1, a confusion matrix, and ROC/AUC, demonstrating an embedding-based SQL injection detection pipeline.
Code 3
sql injection Dataset
Dataset Summary
The SQL Injection Dataset on Kaggle was created and published by Syed Saqlain Hussain Shah and is one of the original public SQL injection corpora used for cybersecurity research and attack detection studies. It contains tens of thousands of SQL query strings collected from multiple websites, with each query labeled as malicious (SQL injection) or benign (normal) for binary classification, and is cleaned and formatted for supervised learning tasks. While Kaggle doesn’t show a formal publication year, the dataset was added around 4 years ago (~2020) based on its upload timeline and community discussion. Researchers and practitioners use it as a benchmark for building and evaluating machine learning and deep learning models that detect SQL injection attacks by analyzing query patterns and textual features.
Code Using the Dataset
Code 1
Code 2
The Kaggle notebook “SQL Inject Using Linear Models and CNN” by Chinonso Cynthia loads the SQL Injection Dataset, performs text preprocessing on the SQL query strings, and encodes the binary labels for classification. The notebook first converts queries into numerical features using TF-IDF vectorization and trains several traditional linear machine learning models (such as Logistic Regression and Support Vector Machine) to classify queries as benign or malicious. It then builds a Convolutional Neural Network (CNN) using Keras to learn patterns directly from tokenized and padded query sequences. Model performance is evaluated using train/test splits along with accuracy and classification metrics, allowing comparison between linear models and deep learning approaches for SQL injection detection.
The Kaggle notebook “EDA – SQL Injection Dataset” by iniestamoh loads the SQL Injection Dataset, removes unused columns, and cleans the data by dropping missing rows and filtering the label field to keep only valid binary classes (0/1), then converts labels to integers. It performs exploratory analysis by reporting dataset structure and class counts, visualizing label balance with a pie chart, and inspecting common SQL-injection characteristics in the query text (e.g., counting payloads containing comment markers like #, --, or //, and examining queries containing FROM to infer referenced table names). The notebook also runs basic NLP-style EDA by tokenizing all query strings with NLTK, computing the most frequent tokens, and plotting the top word frequencies, and it optionally generates an automated pandas-profiling report for a broader dataset overview.
Code 3
The Kaggle notebook “SQL Injection Detection Using Neural Network” by Syed Saqlain Hussain loads the SQL Injection Dataset then converts the Sentence query strings into numeric bag-of-words features using CountVectorizer with English stopwords, and concatenates these features back into the dataframe to form the final feature matrix (X) with Label as the binary target (y). It first trains a Logistic Regression baseline and reports test accuracy, then builds a Keras feedforward neural network (stacked Dense layers with BatchNormalization and Dropout, sigmoid output) trained on the same vectorized features. The notebook evaluates the neural model using accuracy (via thresholding predictions at 0.5) and computes accuracy, precision, and recall using both a custom confusion-matrix function and scikit-learn’s precision/recall metrics.