- Why Cloud Intrusion Detection Must Keep Pace with Attackers
- Building a Cloud Intrusion Detection Pipeline from Raw Packets
- Stage 1: Ingestion
- Stage 2: Streaming
- Stage 3: Feature Engineering
- Stage 4: Detection
- Stage 5: Response (Designed, Not Deployed)
- What I Learned About Building for Operations
- The Numbers, Honestly
- Reproduce It If You Want
- Final Thoughts
My thesis started with a single question that kept me up at night: if someone compromises a container in my cluster, how long do I have before they disappear? The gap between attacker speed and defender reaction was getting wider, and the cloud intrusion detection tools I saw were not built to close it.
The answer, according to the research I read, was measured in seconds. Not days. Not hours. Seconds. In one study, an exposed Kubernetes honeypot was discovered and compromised within minutes of going live. In another, red teams achieved their objectives in five to seven days, well below the global median dwell time of ten days. The gap between attacker speed and defender reaction was getting wider, and the tools I saw were not built to close it.
So I spent six months building a proof of concept. It is not a product. It will not replace your enterprise firewall. But it is a reproducible, end-to-end pipeline that takes raw network traffic, turns it into structured flow records, scores each one with a lightweight machine learning model, and proposes an automated response, all within a latency budget designed for containerized environments. I called it cloud-ids-pipeline, because I had run out of dramatic names by that point.
This post is about what I built, why I made the architectural choices I did, what the numbers actually say, and where it falls short. The goal is to show end-to-end ownership of a hard problem, not to claim a finished product.
Building a cloud intrusion detection system means accepting that speed and accuracy are not opposing goals.
Quick Overview
What I Built
Containerized Pipeline
End-to-end detection from packet capture to scored alert, built for cloud-native environments.
85% Detection Quality
ROC-AUC on CICIDS2017 benchmark using a lightweight, interpretable tree-based model.
Calibrated for Operations
False-alarm budget chosen for real analyst workload, not leaderboard vanity metrics.
Automated Response
Policy engine designed with safe defaults, latency budgets, and human-in-the-loop for edge cases.
Fully Open Source
Reproducible Docker builds, fixed random seeds, and end-to-end documentation on GitHub.
Why Cloud Intrusion Detection Must Keep Pace with Attackers
Cloud-native infrastructure is fast by design. Containers spin up and die in minutes. Microservices talk to each other through encrypted channels that bypass traditional network choke points. Kubernetes schedules workloads across nodes in ways that make static IP-based rules almost meaningless.
That speed is a gift to attackers too. Once a pod is compromised, lateral movement across namespaces can happen before a human analyst finishes their first cup of coffee. The traditional security model assumes you have time. In Kubernetes, you do not.
I wanted to see if I could build a detection gate that operated at cloud speed. Not a deep-learning behemoth that needs a GPU farm. Not a rules engine that misses novel attacks. Something lightweight enough to run as a sidecar, fast enough to score flows in milliseconds, and honest enough to admit when it is uncertain.
And because detecting something is only half the battle, I also wanted a response path that could act automatically on high-confidence alerts while reserving human judgment for the edge cases.
That meant solving four problems in sequence: ingestion, featurization, detection, and response. Each one had to be fast. Each one had to be reproducible. And the whole thing had to run on a laptop. That is why I focused on cloud intrusion detection specifically, rather than a generic monitoring stack.
Building a Cloud Intrusion Detection Pipeline from Raw Packets
The architecture is intentionally boring. Boring is good when you are asking someone to trust it with security decisions. It has four implemented stages and one designed stage, connected by explicit contracts so that any piece can be swapped without breaking the others.
This section explains how each stage of the cloud intrusion detection pipeline contributes to the overall latency budget.
Stage 1: Ingestion
The first challenge is turning raw network traffic into structured data without writing fragile packet parsers. I used an open-source protocol analyzer running in a Docker container with minimal privileges: just enough permissions to see traffic, not enough to own the host. The version is pinned so the behavior is identical on my laptop, a CI runner, or a cloud VM. Immutability matters when you are debugging why a feature value changed.
A typical connection record contains timestamps, origin and responder IPs and ports, protocol, duration, packet counts, and byte counts. From these primitives, everything else is derived.
Ingestion is the first gate in any cloud intrusion detection system: if you cannot parse the traffic, you cannot score it.
Stage 2: Streaming
Writing to disk is fine for forensics, but a model that reads from disk is a model that loses races. So I added a small stateless consumer that tails the connection log and publishes each record into a fast, ordered message stream. Downstream components can block-read new entries in micro-batches without polling.
If the consumer crashes, it restarts and resumes from its last acknowledged position. That resilience is cheap to implement and expensive to get wrong if you build it yourself.
This stage adds negligible overhead. It is not doing feature extraction or scoring. It is just moving structured data from one place to another, quickly and reliably.
A cloud intrusion detection pipeline that writes to disk is a pipeline that loses races.
Stage 3: Feature Engineering
Raw network data is messy. Before any model sees it, the data goes through a hygiene pipeline that would not look out of place in a production ML platform:
- Schema normalization: trimming malformed whitespace, fixing encoding issues in attack labels, and consolidating raw labels into a compact attack taxonomy.
- Physical plausibility checks: dropping rows with negative flow durations, clipping impossible timings, and removing duplicate records.
- Deduplication: removing constant columns and exact duplicates.
- Correlation-aware reduction: grouping highly correlated features and keeping one representative per group. I kept primitive counters over derived rates, and central aggregates over extreme values. This reduced the schema from 69 candidate features to 42.
- Clean splitting: a 70/15/15 train/validation/test split stratified on the binary attack label, with an explicit deduplication pass to prevent identical rows from leaking across folds.
Anyone can throw a model at a CSV and get a number. The harder part is understanding the data well enough to know whether the number means anything. The 42-feature schema is the contract: any online consumer must apply exactly these transforms before calling the scorer.
Feature engineering is where most cloud intrusion detection projects fail, not at the model stage.
Stage 4: Detection
I tested four model families: an unsupervised anomaly detector, two supervised tree ensembles, and two shallow neural baselines. The unsupervised models achieved attack recall below 5%, meaning they effectively guessed. On this dataset, “learning normal” without labels does not work because benign traffic is too varied.
The supervised tree ensembles were a different story. Even with default settings, the best model reached balanced accuracy around 72% and ROC-AUC near 85%. The score distributions showed real separability.
But raw accuracy is not the goal. The goal is an operating point. I calibrated probabilities, then selected a threshold that caps the false positive rate at roughly 9% on validation. At that point, the model catches about 40% of attacks.
The detection stage is the heart of the cloud intrusion detection pipeline, but it is only as good as the data that feeds it.
Why 40% recall and not 99%? Because the alternative, tuning for maximum recall, would drown the operator in false alarms. In security, a model that catches everything but cries wolf ten thousand times a day is worse than a model that catches less but is trusted. I chose the threshold deliberately, as an operational decision, not a leaderboard decision.
Stage 5: Response (Designed, Not Deployed)
Scores are not actions. The final stage is a small policy engine that translates model outputs into cluster operations with safe defaults:
- High-confidence DDoS or aggressive port scan: apply a temporary quarantine network policy to the source pod.
- Probable brute-force behavior: rate-limit or add a temporary egress block.
- Unknown but high-score anomaly: isolate and require human approval.
- Sensitive context, for example a VIP subnet: hold for a second signal.
- Ambiguous or low-volume alert: tag for enhanced monitoring and defer action.
Each action has a clear confirmation event so that latency can be measured later. The architecture does not claim real numbers yet, but it makes the path to getting them obvious.
Response is the least discussed part of cloud intrusion detection, yet it is where the pipeline proves its value.
What I Learned About Building for Operations
I could have built a single Jupyter notebook and called it a thesis. I chose a modular pipeline for three reasons that also matter in production.
The hardest part of building a cloud intrusion detection system is not the model. It is the operational context around it.
Separation of concerns. Each stage does one thing well: protocol parsing, ordered streaming, scoring. When each stage has a narrow contract, you can swap the parser, the message broker, or the model without rewriting the world.
Reproducibility by default. Every component is containerized. The dataset is pinned. The splits use a fixed random seed. The notebook regenerates all figures and tables from scratch. The repo is designed so anyone can clone it, run the quick-start commands, replay the demo traffic, and see the same stream I saw. If it does not, that is a bug, not a feature.
Latency budgeting. The system is designed around explicit time budgets: parsing and validation in milliseconds, featurization in tens of milliseconds, inference in tens of milliseconds, policy lookup in under ten milliseconds. Network policy application is the variable, but even that should land under a few hundred milliseconds in a healthy cluster. Designing to these budgets keeps the pipeline honest about whether it can actually meet the “cloud speed” requirement.
I apply the same reproducibility philosophy to my self-hosted cloud security lab.
The Numbers, Honestly
Let me be direct about the limitations, because a project write-up that hides its flaws is worse than no write-up at all.
A cloud intrusion detection model trained on curated data needs validation on live traffic before it earns trust.
The dataset is curated. The benchmark is convenient and labeled, but it is not real cloud traffic. The attacks are staged, the benign traffic is synthetic, and some columns contain artefacts that required aggressive cleaning. A model that works here needs validation on live data before it earns trust.
Any cloud intrusion detection model trained on curated data needs validation on live traffic before it earns trust.
Attack family classification does not work yet. I built the scaffold: stratified splits, class weighting, seven attack groups. Every model I tested sits near chance on macro-averaged metrics. Denial-of-service, DDoS, and port scans share too many flow-level signatures to distinguish without richer context. Even a simplified “DoS vs. Other” task barely beats a coin flip.
This is a useful negative result: it tells me that single-flow features are insufficient for family tagging, and that the next iteration needs host-level aggregates, burstiness metrics, or short sequence windows.
Improving cloud intrusion detection accuracy requires richer features beyond single-flow statistics.
There is no live deployment. The pipeline is designed for latency measurement, but the only number I can credibly report today is model compute time. Detection latency, response latency, and end-to-end latency are specified as a measurement plan, not a benchmark. I have not stood up the inference service or policy controller in a real cluster yet.
Real-world cloud intrusion detection validation is the next milestone, not an afterthought.
Precision is inherently low at this scale. With a 9% false positive rate on benign traffic that dominates by volume, you will still see hundreds of false alarms per hour in a busy environment.
The saving grace is that these are flow-level alerts, and a production system would correlate them into incident-level signals before a human ever sees them. In security ML, this is normal. You tune for recall first, then correlate and filter.
Correlating flow-level alerts into incident-level signals is standard practice in cloud intrusion detection operations.
Reproduce It If You Want
The pipeline is open source and documented. If you want to run it yourself, you need Docker Desktop, a free afternoon, and optionally the public benchmark dataset if you want to rerun the modeling notebook.
If you want to experiment with cloud intrusion detection, the entire stack is open source.
The README has a quick-start:
Quick-start commands
make up – start the streaming backbone and Zeek parser
make consumer – start the Python shipper that feeds Redis
make replay – replay demo traffic through the pipeline
make logs – watch live connection logs in real time
To reproduce the thesis results, download the dataset, place the CSVs in data/TrafficCSV/, install the Python requirements, and run data_prep.ipynb top to bottom. Every figure, every table, every metric is regenerated from the raw data.
Final Thoughts
This project taught me that detection is only the beginning. Anyone can train a model on a public dataset and report an AUC. The harder work is everything around it: cleaning the data honestly, splitting it without leakage, calibrating the scores so thresholds mean something, designing a response path that is fast but not reckless, and documenting the whole thing so someone else can pick it up.
This cloud intrusion detection pipeline is a proof of concept, not a product.
The most important skill I practiced was operational thinking. Security ML is not a competition. An 85% ROC-AUC means nothing if the model takes five seconds to score, or if it generates ten thousand false positives an hour, or if the response path requires three manual approvals before it can act. The question is not “how accurate is this?” but “can a team actually run this in anger?”
The honest answer for this project is: partially. The ingestion pipeline works. The data hygiene is solid. The detection gate is calibrated and reproducible. The response path is designed but not yet deployed. Attack family tagging needs richer features. And nothing here has seen real cloud traffic.
The most important skill I practiced was designing a cloud intrusion detection system that a team could actually run.
But that is exactly what a good write-up should show: not a finished product, but a clear trail of decisions, trade-offs, and next steps.
If you are building a cloud intrusion detection system, start with reproducibility. Everything else follows.
The most reliable signal of whether someone will improve a system is whether they can tell you exactly where it breaks.
If you are interested in defensive baseline work, I also documented how I automated asset discovery and CVE reporting in a home lab.
Questions about the pipeline? Want to compare notes on cloud-native detection or Kubernetes security? I am always happy to chat. Reach out on LinkedIn or GitHub. And if you build something similar, I would genuinely love to hear what you learned.