Story 21 — First Week on the Floor · Gary's Security Stories

Story 21 · Domain 4 · Alerting, Monitoring, Incident Response

First Week on the Floor.

The Sentinel queue has 1,400 unacknowledged alerts. The senior engineer tells her most of them have been sitting there for three weeks. The SOC lead tells her the on-call rotation has stopped checking overnight. Nobody says it directly: the SIEM has become furniture.

By The Editors Photography: Cipher Lane Observer 7 min read

Priya Chandrasekaran spent five years doing application security at a consultancy before deciding she wanted to work closer to the wire. Her first morning at Veritas Payments — a mid-size London fintech processing card transactions for four hundred UK retailers — she is shown to a standing desk, handed a laptop, and introduced to the Microsoft Sentinel deployment that nobody has touched in eleven months.

The SOC lead, Dermot, is forty-two, visibly tired, and apologetic. "We brought in Sentinel as part of the SOC 2 push. It's ingesting. The rules run. The alerts fire. But we've got three people rotating on-call and nobody's had time to tune anything." He gestures at the queue. "Most of what's in there is noise. But some of it probably isn't."

The senior detection engineer, Yemi, pulls a chair over on day one and gives Priya the honest version. "The problem isn't that we lack visibility. The problem is we've connected everything and written no rules of our own. We're running the vendor defaults against a production environment we haven't baselined. Every new log source fires on its defaults for about three days and then gets ignored."

Priya opens a blank Notion page and writes one line: understand what we're ingesting before touching a single rule.

A SIEM — Security Information and Event Management — is not magic. It is a pipeline. Log data from multiple sources flows in, is parsed and normalised into a common schema, and is then compared against correlation rules. When a rule fires, an alert is generated. The SIEM does not decide whether something is malicious. That is still the analyst's job. The SIEM's job is to surface the right patterns quickly enough that a human can investigate them inside a useful window of time.

Veritas's Sentinel ingests from five primary sources: Azure Active Directory sign-in logs, the Palo Alto firewall at the network perimeter, a CrowdStrike EDR deployment across all endpoints, Microsoft 365 mailbox audit logs, and — since last quarter — cloud audit trails from Azure. Each source arrives as a different format. Sentinel's log parsers normalise them into the Common Event Format, which is the SIEM equivalent of a lingua franca. Without normalisation, correlating a firewall event with an Active Directory event is like reading two books written in different languages.

The actual forwarding is handled by two log forwarders: Fluent Bit running on Linux hosts, routing syslog-format events into the Sentinel workspace; and the Palo Alto's native HTTPS export to the Azure Monitor ingestion endpoint. Priya reads the forwarder config on day one because log gaps are invisible until you look for them. Three on-premises servers in the Manchester office are not forwarding anything.

She adds a second line to the Notion page: Manchester is blind.

A SIEM that fires on everything is the same as a SIEM that fires on nothing. The queue becomes a wall. The wall becomes wallpaper. The wallpaper becomes the compromise you didn't notice. — Story 21 · Alerting & Monitoring

On Tuesday, Priya pulls the alert volume stats for the past thirty days. The raw number is 41,200 alerts. Dermot's on-call rotation has three people. The average MTTD — mean time to detect, the gap between an event occurring and an analyst acknowledging it — is currently eleven hours and forty minutes. The MTTR — mean time to respond, detection to containment — is not formally tracked at all. Priya writes both numbers on the whiteboard in the SOC room and does not erase them. They are the two metrics that will tell you whether a detection programme is functioning. Everything else is commentary.

The Tuesday afternoon problem is alert fatigue. When false positives dominate the queue, analysts stop looking. It is not laziness; it is learned helplessness. When every investigation leads to a dead end, the rational response is to triage less aggressively. The dangerous outcome of alert fatigue is not exhaustion — it is the false negative, the real incident that fires an alert that nobody checks because it looks like the other four hundred alerts from that day.

Priya categorises the current alert rule set. Sixty-three percent of alert volume comes from nine rules. Of those nine, seven are set with thresholds calibrated against Microsoft's generic benchmark tenant, not against Veritas's actual environment. A rule that fires when a user signs in from two countries in under an hour is reasonable for most companies. Veritas has a London engineering team and an Edinburgh QA team whose VPN exit nodes are both flagged as high-risk. The rule fires on them constantly.

Yemi watches from across the room. "You're finding what I expected you to find," he says. "The question is what you do about it."

Alert tuning is not switching rules off. Alert tuning is adjusting the precision of detection — reducing the ratio of false positives to total alerts — without reducing recall, the ability to catch real incidents. The tension is real. Raise the threshold to cut noise and you risk missing a genuine lateral-movement event that falls just below it. Lower the threshold and the queue fills with VPN false positives. The right answer is not a number; it is a baseline.

Wednesday is baseline day. Priya and the L1 analyst, Tobi, run a NetFlow analysis against the previous six weeks of firewall logs. NetFlow — and its successor protocol IPFIX — records traffic metadata: source IP, destination IP, port, protocol, byte count, packet count, duration. It does not capture the payload. What it gives you is a map of normal: which hosts talk to which other hosts, at what volume, on which ports. This is east-west visibility, the internal network traffic that perimeter firewalls do not see. An attacker who is already inside can move laterally for weeks without triggering any perimeter rule. NetFlow shows you the inside.

After six hours, Priya has a baseline for Veritas's internal traffic. The payment processing cluster has a known, stable communication pattern: a fixed set of hosts, fixed ports, steady throughput that spikes predictably on Friday afternoons. The developer workstations are noisier but follow a pattern. The Manchester servers, when they eventually get their forwarder installed by Raj the platform engineer on Wednesday afternoon, show something unexpected: one host has been generating outbound HTTPS connections to a residential broadband range in Leeds at volumes three times higher than any other host on the estate. The connections started nine days ago.

Priya does not immediately open the host. She opens a new incident ticket. She writes the time in the header and takes a screenshot of the raw NetFlow before touching anything.

That discipline has a name: the seven phases of incident response. Preparation: have the playbooks, tools, and trained people in place before an incident occurs. Identification: detect the anomaly and determine whether it constitutes an incident. Containment — short-term: stop the bleeding immediately, before root cause is established; isolate the host, block the egress path. Containment — long-term: apply durable controls that hold while investigation continues. Eradication: remove the threat from the environment entirely — malware, persistence mechanism, rogue account, whatever it is. Recovery: restore service in a verified-clean state, retest, reaudit. Lessons learned: post-incident review, five whys, timeline, update the playbook. The phases are a sequence, not a checklist. You can cycle back. Containment may reveal new scope that restarts the identification phase. The model is iterative.

Yemi sits with Priya through the investigation. The Manchester host turns out to be a file transfer agent that a developer installed twelve months ago to move large test datasets to an external staging environment. It was never documented. It was never decommissioned when the project ended. It has been running continuously, transferring nothing useful, generating outbound connections that look suspicious in isolation and mundane once traced. Not a compromise. A ghost process left by a project nobody closed properly. MTTD on this one: nine days. Not because the detection failed — there was no detection rule for it — but because Manchester was not in the logging estate until today.

Priya closes the incident as resolved and immediately opens a new ticket: audit all log sources for gaps. She has found the pattern. Blind spots in the logging estate are the places incidents hide.

Thursday, Priya drafts the alerting maturity framework on the whiteboard. At the bottom: everything pages, nothing is tuned, analysts are exhausted. One level up: rules are baselined to the environment, false positive rate is below twenty percent, on-call can actually investigate. Above that: MTTD under sixty minutes, MTTR tracked and improving, suppression and deduplication rules maintained on a monthly cadence. At the top: proactive threat hunting — humans searching through logs for threats that automated rules haven't identified, on the assumption that a compromise may already exist that nobody has noticed.

The maturity ladder is not a destination. It is a direction. Veritas is, on Thursday morning, somewhere between the bottom rung and the second. By Friday, with the nine noisy rules retuned against the baseline, the alert volume drops from 1,400 daily to 310. Dermot looks at the number for a long moment. "That's still a lot," he says. "But it's a number you can read."

Priya writes the new MTTD on the whiteboard under the old one. Eleven hours forty minutes. Then: four hours twelve minutes. It is Friday afternoon. She has been in the job five days.

// SOC NAPKIN

The 7 phases of IR.

1 — Preparation

2 — Identification

3 — Containment (short-term)

4 — Containment (long-term)

5 — Eradication

6 — Recovery

7 — Lessons Learned

Note: NIST SP 800-61 uses four phases (Preparation / Detection & Analysis / Containment, Eradication & Recovery / Post-Incident). SY0-701 expects the seven-phase model. Know both.

// Alerting Maturity Ladder

Level 0: everything fires, nothing is tuned, on-call is broken.

Level 1: environment baseline, FP rate <20%, MTTD tracked.

Level 2: MTTR tracked, suppression on cadence, SLAs enforced.

Level 3: proactive threat hunting alongside automated detection.

// Precision vs Recall

Precision: of the alerts that fired, how many were real? High precision = low noise.

Recall: of the real incidents, how many fired an alert? Low recall = missed attacks. Tune both, not just one.

// NetFlow / IPFIX

Traffic metadata without payload: who → who, port, bytes, duration. East-west visibility. Baseline normal, then detect deviation.

// Terms Introduced

SIEM (collect, normalise, correlate, alert)
Log aggregation / CEF normalisation
Fluent Bit / Filebeat (log forwarders)
Correlation rule
Alert fatigue / alert tuning
True / false positive / false negative
MTTD & MTTR
NetFlow / IPFIX (traffic telemetry)
7 phases of incident response
Threat hunting (proactive, human-led)
Log retention & cost

FIRST WEEK ON THE FLOOR.

First Week on the Floor.

Priya's detection rule for impossible travel is generating hundreds of false positives per day against Veritas's VPN-heavy workforce. She wants to reduce noise without creating blind spots for genuine account compromise. Which action BEST addresses this?

The Napkin Glossary.