DDoS Practice Test File (CSV)
=============================

This is a synthetic, defender-focused dataset for practicing machine-learning defenses against Distributed Denial of Service (DDoS) attacks.
Each row summarizes *one minute* of traffic/telemetry for a specific target service.

label:
  0 = normal traffic
  1 = DDoS attack traffic

Time window
-----------
- window_start_utc is the start of the 1-minute aggregation window (UTC).

Columns
-------
window_start_utc
  Start of the 1-minute aggregation window in UTC (ISO-8601).

target_id
  Identifier of the protected service instance (e.g., web-1, api-1, dns-1).

service
  Human-readable service type (HTTP, HTTPS, DNS).

total_requests
  Total inbound requests/queries observed during the 1-minute window.

unique_src_ips
  Count of distinct source IP addresses seen in the window.
  DDoS botnets often increase this sharply (but not always).

src_ip_entropy
  Shannon entropy of the per-source request distribution (higher can mean more evenly distributed bot traffic).
  Computed on a capped sample for efficiency; treat as an approximate signal.

tcp_ratio, udp_ratio, icmp_ratio
  Fraction of observed traffic by protocol family in the window.
  Certain DDoS patterns skew these ratios (e.g., UDP flood for DNS, TCP SYN flood for web).

total_packets
  Estimated total packets during the window (proxy derived from request volume and typical packetization).

total_bytes
  Estimated total bytes during the window.

avg_packet_size_bytes
  Average packet size in bytes (synthetic). Some floods have distinct packet-size patterns.

packets_per_second
  Packets per second during the window (total_packets / 60).

bits_per_second
  Bits per second during the window ((total_bytes*8)/60).

syn_count, ack_count, fin_count, rst_count
  Synthetic TCP-flag-like counters for the window (useful for distinguishing SYN floods vs normal traffic).
  These are proxies (not raw packet captures).

error_4xx_rate
  Approximate fraction of requests producing client-side errors (0.0 to 1.0).

error_5xx_rate
  Approximate fraction of requests producing server-side errors (0.0 to 1.0).
  Under saturation, 5xx rates may increase.

Suggested Practice Tasks
------------------------
1) Train a baseline classifier (RandomForest / XGBoost / Logistic Regression) to predict label.
2) Evaluate with Precision/Recall and PR-AUC (false positives matter).
3) Try an unsupervised approach (Isolation Forest) using only rows with label=0 for training.
4) Add rolling features (e.g., 5–15 minute rolling mean/max) per target_id.
5) Choose an alert threshold that keeps false positives below a chosen rate while maintaining high recall.