DDoS Practice Test File (CSV) ============================= This is a synthetic, defender-focused dataset for practicing machine-learning defenses against Distributed Denial of Service (DDoS) attacks. Each row summarizes *one minute* of traffic/telemetry for a specific target service. label: 0 = normal traffic 1 = DDoS attack traffic Time window ----------- - window_start_utc is the start of the 1-minute aggregation window (UTC). Columns ------- window_start_utc Start of the 1-minute aggregation window in UTC (ISO-8601). target_id Identifier of the protected service instance (e.g., web-1, api-1, dns-1). service Human-readable service type (HTTP, HTTPS, DNS). total_requests Total inbound requests/queries observed during the 1-minute window. unique_src_ips Count of distinct source IP addresses seen in the window. DDoS botnets often increase this sharply (but not always). src_ip_entropy Shannon entropy of the per-source request distribution (higher can mean more evenly distributed bot traffic). Computed on a capped sample for efficiency; treat as an approximate signal. tcp_ratio, udp_ratio, icmp_ratio Fraction of observed traffic by protocol family in the window. Certain DDoS patterns skew these ratios (e.g., UDP flood for DNS, TCP SYN flood for web). total_packets Estimated total packets during the window (proxy derived from request volume and typical packetization). total_bytes Estimated total bytes during the window. avg_packet_size_bytes Average packet size in bytes (synthetic). Some floods have distinct packet-size patterns. packets_per_second Packets per second during the window (total_packets / 60). bits_per_second Bits per second during the window ((total_bytes*8)/60). syn_count, ack_count, fin_count, rst_count Synthetic TCP-flag-like counters for the window (useful for distinguishing SYN floods vs normal traffic). These are proxies (not raw packet captures). error_4xx_rate Approximate fraction of requests producing client-side errors (0.0 to 1.0). error_5xx_rate Approximate fraction of requests producing server-side errors (0.0 to 1.0). Under saturation, 5xx rates may increase. Suggested Practice Tasks ------------------------ 1) Train a baseline classifier (RandomForest / XGBoost / Logistic Regression) to predict label. 2) Evaluate with Precision/Recall and PR-AUC (false positives matter). 3) Try an unsupervised approach (Isolation Forest) using only rows with label=0 for training. 4) Add rolling features (e.g., 5–15 minute rolling mean/max) per target_id. 5) Choose an alert threshold that keeps false positives below a chosen rate while maintaining high recall.