Story 12 — The Quarterly Drill · Gary's Security Stories

Story 12 · Domain 3 · Resilience & Backups

The Quarterly Drill.

NHS Hartfield Trust runs a scheduled DR failover every quarter. The BCP says RTO is two hours. The CISO says the BCP is aspirational. Infrastructure engineer Ifeoma Adeyemi has six hours on a Saturday morning to find out which one is correct.

By The Editors Photography: Cipher Lane Observer 6 min read

06:03. Ifeoma pulls the failover lever in the change management console and the primary clinical systems stop answering. For the next six hours, Hartfield NHS Trust runs on its warm site in Coventry. Somewhere on a ward, a nurse notices her order form is a fraction slower to load. That fraction is the gap between the BCP as written and the BCP as lived.

The quarterly DR exercise is booked for a Saturday in April. The window opens at 06:00 and must close by 12:00. Hartfield's primary data centre is an on-premises facility at the main hospital site. The secondary site is a colocation rack sixty miles north, connected via a leased line. The exercise is a controlled failover of the trust's clinical systems to the secondary site and back again — four hours of simulated primary-site loss, full production traffic on the standby, then a failback before the afternoon clinics resume.

Ifeoma arrives at 05:45. She opens the BCP document in Confluence — reference BCM-0021 — and reads the two numbers that matter. The RPO — Recovery Point Objective is fifteen minutes. That is the maximum acceptable data loss: the furthest back in time Hartfield is permitted to be when primary systems come back online. RPO looks backwards. It answers the question: how much data can we afford to lose? In a hospital context, fifteen minutes is not a comfortable figure — it means in the worst case, fifteen minutes of patient observations, medication orders, and test results could be re-entered from paper records. The clinical informatics team set it. It is driven by patient safety, not by what is technically convenient.

The second number is the RTO — Recovery Time Objective: two hours. That is the maximum acceptable downtime from the moment of declared failure to the moment clinical systems are fully operational on the standby site. RTO looks forwards. It answers the question: how long can we be down? Two hours at 03:00 on a weekday is survivable. Two hours from 08:30 on a Monday morning, with outpatient clinics starting, is not. The numbers are in the BCP. The drill exists to prove they are real.

Ifeoma writes the two numbers on the whiteboard in the comms room and underlines them. The team is seven people. Nobody is treating this as a box-tick.

The storage architecture underpinning both sites is the first thing she walks the two junior engineers through at 06:10. The primary site runs a NetApp AFF all-flash array. The secondary runs a lower-tier Pure Storage FlashArray. Both use RAID internally to protect against drive failure — and the RAID level choice is not arbitrary.

RAID 0 stripes data across multiple drives for maximum read and write throughput. If one drive fails, all data is gone. There is no redundancy whatsoever. Hartfield does not use RAID 0 in production storage. It appears here only because the SY0-701 exam expects you to know it exists.

RAID 1 mirrors every write to two drives simultaneously. One drive can fail and the other continues without interruption. The cost is capacity: you pay for two drives and get one drive's worth of usable space. Used in the trust's domain controllers, where the data set is small and the integrity requirement is absolute.

RAID 5 distributes data and a single parity block across a minimum of three drives. If any one drive fails, the array rebuilds the missing data mathematically from the parity. One failure tolerated. This is the configuration on the trust's file servers — a sensible balance of capacity efficiency and fault tolerance for large data sets.

RAID 6 extends RAID 5 by writing two independent parity blocks. Any two drives can fail simultaneously and the array keeps running. The secondary site's Pure Storage array uses RAID 6 internally for exactly this reason: a standby array may sit idle for months, then be asked to carry full production load on short notice. The risk of a second drive failure during a rebuild — already elevated when one drive has just failed — is higher on a system that has been dormant. Double parity is the appropriate response.

RAID 10 (also written 1+0) mirrors drives in pairs and then stripes across the pairs. It combines the write safety of RAID 1 with the read performance of RAID 0. Minimum four drives. The clinical database servers at the primary site use RAID 10: the electronic patient record database writes constantly and a slow rebuild during peak hours is clinically unacceptable.

At 06:30, Ifeoma's colleague Ajit Singh talks through the backup architecture before the exercise begins. The trust uses Veeam Backup & Replication as its primary backup platform, with jobs tiered across three storage targets. This is the 3-2-1 rule made concrete: three copies of data, on two different media types, with one copy offsite. Veeam writes to a local backup repository on the primary site's NetApp, replicates incrementally to the secondary site, and pushes a nightly archive to Azure Blob Storage via Veeam's cloud tier. Three copies. Primary site NAS, secondary site NAS, Azure object storage. Two media types. On-premises block storage and cloud object storage. One offsite. Azure, in a different physical region entirely.

The backup schedule itself is layered. Every Sunday at 22:00, a full backup runs against all clinical systems — the electronic patient record, the radiology PACS, the pharmacy dispensing system, the theatre scheduling application. A full backup captures everything: every file, every database page, every configuration. It is slow to create and large in size, but simplest to restore from. Monday through Saturday, incremental backups run every fifteen minutes, capturing only the blocks that changed since the previous backup job. Fifteen-minute increments are what makes the fifteen-minute RPO achievable. A differential backup, by contrast, captures everything that changed since the last full backup. Differentials grow larger each day but keep the restore chain short: you need only the Sunday full and the latest differential. Incrementals are smaller each run but require the full chain to restore: Sunday plus Monday plus Tuesday, in sequence.

The NetApp array also takes storage snapshots every five minutes using NetApp SnapMirror — point-in-time volume copies that restore instantly without traversing a backup chain. Snapshots are not backups: they live on the same storage system and offer no protection against site failure. But for fast rollback from a bad deployment or an accidental mass-delete, they are the right tool.

Ifeoma stands in front of the whiteboard at 06:55. "The goal of the next four hours," she says, "is to find out whether BCM-0021 is a real plan or a document that makes auditors comfortable."

At 07:00, she triggers the controlled failover. This is where the DR site classification becomes tangible. Hartfield's secondary site is a warm site: the hardware is racked and powered, the network is provisioned, the operating system images are deployed, and the Veeam replica jobs keep the secondary copies within fifteen minutes of the primary. But the clinical applications are not actively running there. Activating the standby requires Veeam to perform a failover job — promoting the replica VMs, adjusting DNS, and cutting over the load balancer — a process that takes time. A hot site would be running all clinical applications in parallel, with traffic flowing to both sites simultaneously, ready for instant failover with no activation delay. Hot sites are used by the trust's payroll and HR systems, which run on Azure with active/active geo-redundancy. The cost for a hot site across all clinical systems would be prohibitive. The warm site is the clinical compromise: fast enough, affordable enough, contractually defensible.

A cold site is the third tier — a rented data centre space with power and cooling but no provisioned hardware. Cold sites exist for catastrophic scenarios where the primary and secondary both fail. Activation takes weeks. For Hartfield, a cold site arrangement exists with a managed service provider in Birmingham as a last-resort option documented in the major incident plan, separate from the quarterly drill.

At 07:44, the secondary site's Veeam failover job completes. Clinical traffic is being served from the secondary site. The first clock stops. That is forty-four minutes from declared failure to clinical systems online — well inside the two-hour RTO. The Recovery Time Objective has been demonstrated.

Ifeoma marks the time in the drill log and moves immediately to the harder question: what does the data look like?

The data validation check is where drills reveal real problems. Ajit runs a restore test against the incremental backup chain from the previous night. Not a failover test — a restore test. The distinction is important and consistently misunderstood. Failing over to a replica proves that replication worked. Restoring from a backup chain proves the backup itself is valid. These are different operations. Many organisations have discovered, mid-incident, that their backup jobs completed successfully for months but the backup data was corrupt — a database backup that captured a locked file, a Veeam job that ran against an application that had crashed mid-write, a cloud archive whose retention policy had expired the data they needed. Testing the restore, not just the backup, is the only way to find this out before it matters.

The restore test surfaces something. The pharmacy system's incremental chain has a 22-minute gap from three days earlier — a failed job that retried successfully but captured a point twenty-two minutes later than expected. Had this been a real incident, the effective RPO for pharmacy would have been thirty-seven minutes, not fifteen. It is exactly the kind of gap that quarterly drills exist to find before it matters at 03:00 on a Tuesday.

The availability dimension of the CIA triad — Confidentiality, Integrity, Availability — is often treated as the least glamorous of the three. Confidentiality breaches make headlines. Availability failures are the ones that stop clinicians treating patients. Every architectural decision in this drill maps to it: RAID protects against drive failure, backup strategy defines survivable data loss, DR site classification sets recovery speed, and HA clustering ensures critical systems absorb single-node failures without any intervention.

The clinical systems run on a VMware vSphere cluster — seven ESXi hosts presenting a shared compute pool. If a host fails, vSphere HA restarts its virtual machines on surviving hosts automatically. The cluster is active/passive: VMs run on live hosts, with passive capacity reserved for failover. The load balancer tier is different — two F5 BIG-IP instances run active/active, both handling traffic simultaneously. If one fails, the other absorbs everything with no switchover delay. Active/active is the better RTO; active/passive is the lower cost. Both are high availability patterns. A load balancer distributes requests across the cluster; a cluster presents multiple hosts as a single logical system.

At 10:52, Ifeoma triggers the failback. Primary site systems come online. Traffic cuts back over at 11:09. The drill closes at 11:23 — thirty-seven minutes inside the six-hour window.

In the afternoon, Ifeoma writes two entries in the DR log. The first: RTO confirmed at forty-four minutes. RPO confirmed at fifteen minutes for all systems except pharmacy, where an undetected job gap introduced a thirty-seven-minute effective RPO. The second: pharmacy backup schedule reviewed and a monitoring alert added to Veeam's alert policy to flag any incremental job failure that is not followed by a successful retry within ten minutes. BCM-0021 is updated. The audit trail is complete. The numbers are now evidence, not aspiration.

A backup you have never restored is not a backup. It is a belief. The drill exists because beliefs do not survive contact with a major incident. — Story 12 · Resilience & Backups

// DR NAPKIN

The numbers.

// RAID

RAID 0 — stripe · fast · zero safety

RAID 1 — mirror · survives 1 failure

RAID 5 — parity · survives 1 · min 3 drives

RAID 6 — double parity · survives 2 · min 4

RAID 10 — mirror + stripe · fast + safe · min 4

// Backup types

Type	What it captures	Restore chain
Full	Everything	Full only
Incremental	Since last backup	Full + every incremental
Differential	Since last full	Full + latest differential
Snapshot	Point-in-time state	Direct mount / instant

// Objectives

RPO = max data loss · looks backward

RTO = max downtime · looks forward

// DR Sites

Hot — live, instant failover · $$$

Warm — provisioned, hours to activate · $$

Cold — bare, weeks to build · $

// Terms Introduced

RAID 0, 1, 5, 6, 10
RPO, RTO
Full, Incremental, Differential, Snapshot
3-2-1 rule
Hot, Warm, Cold site
HA, Cluster, Active/active, Active/passive
Load balancing, Failover
Availability (CIA triad)

THEQUARTERLY DRILL.

The Quarterly Drill.

Hartfield Trust's BCP states an RPO of fifteen minutes for clinical systems. Which backup approach best supports this objective?

The DR Plan.