Operational resilience: staying up when things break

For years, "don't break" was the goal. Buy redundant hardware, write the disaster-recovery plan, hope. Operational resilience starts from a more honest premise: disruption is not a rare accident to be engineered out, it is a certainty to be survived. The question stops being "how do we stop outages?" and becomes "when an outage hits, can the services our customers truly rely on keep running, or recover fast enough that nobody gets hurt?"

The quick version

Operational resilience is the ability to keep delivering your most important services through disruption, and to recover when prevention fails. It assumes things will break, rather than betting they won't.
It rests on three moves: name your important business services (the ones whose failure harms customers or the market), set an impact tolerance for each (the maximum disruption you can take before real harm), and test against severe-but-plausible scenarios.
It is now a legal expectation, not just good practice. UK financial firms had to operate within their impact tolerances by 31 March 2025; the EU's DORA regulation applied from January 2025.
The trap is treating it as a document. Resilience you have never tested is a hypothesis, not a capability.

The idea in depth: resilience is about services, not systems

The most useful shift in this field came from regulators, of all places. After a run of banking outages left people unable to access their own money, the Bank of England, the Prudential Regulation Authority (PRA) and the Financial Conduct Authority (FCA) published a joint policy framework in March 2021 that reframed the whole problem. Their starting paper, "Building operational resilience: impact tolerances for important business services" (2019), made the case that firms should plan from the customer's point of view, not the org chart's. Don't ask "is the database up?" Ask "can a customer still get paid, get a payment out, get a claim settled?"

That reframing produces the discipline's core vocabulary. An important business service is one whose disruption would cause intolerable harm to customers or threaten the wider market, not every service, just the ones that matter. An impact tolerance is the maximum tolerable level of disruption to that service, usually expressed as a hard limit: this service must not be down for more than X hours. In the regulators' own terms, it is the maximum level of disruption a firm can tolerate before the harm to consumers or markets becomes unacceptable. So the move is: stop maintaining a flat list of systems ranked by technical criticality, and instead map the handful of services a customer would notice losing, then, for each, write down the single number (hours, or transactions, or amount of data) past which the harm becomes real. That number forces priorities that a system inventory never will.

UK regulators turned this into a deadline. After a transition period, firms in scope had until 31 March 2025 to show they could remain within their impact tolerances for each important business service in the event of a severe but plausible disruption (a milestone covered in Sidley Austin's guide to the rules and reflected on by the FCA's own post-deadline observations). In parallel, the EU's Digital Operational Resilience Act (DORA) applied from 17 January 2025, hard-coding ICT risk management, incident reporting and, notably, oversight of the third-party providers everyone quietly depends on.

An honest limitation. This framework was written for regulated finance, and it shows: the language of "intolerable harm to the market" is a poor fit for a marketing SaaS or a logistics startup. The principles travel; the thresholds don't. If you borrow the model, borrow the questions (which services matter, how much disruption is too much, what would actually break them) rather than importing a compliance checklist your business never needed. Resilience theatre, a binder of policies nobody has tested, is its own failure mode, and a regulatory template makes it easier to perform.

flowchart TD
  A(["List candidate services"]) --> B{"Would losing it
harm customers
or the market?"}
  B -->|"No"| C(["Important, but not
top-tier, monitor"])
  B -->|"Yes"| D(["Important business service"])
  D --> E(["Set an impact tolerance
max tolerable disruption"])
  E --> F(["Map the chain:
people, process, tech,
data, third parties"])
  F --> G(["Test against
severe-but-plausible
scenarios"])
  G -.->|"Can't stay within
tolerance"| H(["Invest / redesign
then re-test"])
  G -->|"Stays within
tolerance"| I(["Evidenced resilience"])

The operational-resilience loop: from naming what matters to proving it survives. Leaders Loop

The engineering half: assume failure, then go break things on purpose

The regulators supplied the language of services and tolerances. A parallel tradition, born in engineering, supplied the method for finding out whether your resilience is real: stop waiting for failure and cause it deliberately. Chaos engineering emerged at Netflix, which open-sourced its Chaos Monkey tool in 2012, a program that randomly kills production servers during business hours, on the logic that if your system can't survive a single instance dying when you're watching, you'd rather learn that now than at 3am during a real incident.

The discipline has a published creed. The Principles of Chaos Engineering define it as "the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production." The method is disciplined, not reckless: define the system's normal steady state using a business metric (orders per second, say), hypothesise it will hold, then introduce a real-world fault, a dead server, a slow dependency, a network partition, and watch whether the steady state survives. Crucially, you minimise the blast radius: run the experiment on a small slice of traffic first, with a way to abort, so an experiment meant to build confidence can't become the outage it was studying.

Resilience you have never tested is not a capability. It is a hope with good documentation.

So the move is: run a "game day." Pick one important service, gather the people who own it, and inject a plausible failure in a controlled window, "the payments provider is timing out, go." Watch what actually happens, not what the runbook says should. The first game day almost always surfaces something embarrassing: a runbook pointing at a person who left, a failover that was never wired up, an alert nobody owns. That embarrassment, found on a Tuesday afternoon, is the entire point. As the Bank of England's framing puts it, you test against severe but plausible scenarios, severe enough to stress the system, plausible enough that the lessons are real. A meteor strike teaches you nothing; a region-wide cloud outage teaches you plenty.

An honest limitation. Chaos engineering assumes a mature, observable, well-instrumented system with the safety nets to abort an experiment. Run it on a fragile platform with no monitoring and no rollback and you are not testing resilience, you are causing the incident yourself. The order matters: get the basics, backups, monitoring, a tested restore, before you start breaking things on purpose.

Where resilience actually lives: the dependency you forgot

The recurring lesson of real incidents is that systems fail at their seams. Your service may be flawless, but it rides on a cloud region, a payments processor, an identity provider, a DNS resolver, and a fault in any of them is your outage too, regardless of whose name is on the status page. This is exactly why DORA put third-party ICT providers under direct oversight, and why the UK followed with rules on critical third parties in 2024: when the whole sector leans on the same handful of cloud and software vendors, one provider's bad day becomes everyone's.

So the move is to map the chain end to end for each important service, people, process, technology, facilities, and the external suppliers, and ask of every link: what happens if this is gone for an hour? A day? The dependency you can't readily answer for is the one most likely to take you down, because it's the one nobody has been made responsible for. Resilience, in practice, lives less in your own code than in the honesty of that map.

A worked example

Take a mid-sized online lender, call it Brightline. (Illustrative figures throughout; this is a teaching example, not a real firm.) Brightline runs dozens of systems, but when its team applies the lens above, only three count as important business services from a customer's point of view: taking a loan application, disbursing approved funds, and taking repayments. Everything else, the marketing site, the internal BI dashboards, matters, but a customer survives losing it for a day.

For "disbursing approved funds," Brightline sets an impact tolerance: no approved customer should wait more than four hours for their money, because beyond that the harm (a missed completion, a bounced commitment) becomes real. Then they map the chain and find the seam: disbursement depends entirely on a single payments provider, with no fallback. On paper, resilient. Untested.

flowchart LR
  A(["Approved loan"]) --> B(["Payments provider
(single supplier)"])
  B --> C(["Funds in
customer account"])
  B -.->|"Provider down 6 hrs
game-day injection"| D{"Within the
4-hour tolerance?"}
  D -->|"No, breach"| E(["Add a second rail
+ manual fallback
then re-test"])
  D -->|"Yes"| F(["Tolerance evidenced"])

Brightline's game day turns a paper assumption into a tested limit, and finds the breach before a customer does. Leaders Loop

So they run a game day on a controlled window: simulate the provider timing out for six hours and watch. The four-hour tolerance is breached comfortably, there is no second rail and no manual process to push payments by hand. That finding, on a quiet afternoon, is worth more than any policy document. Brightline now has a concrete, prioritised investment: a backup payments rail and a manual fallback procedure, after which they re-test until disbursement genuinely survives the loss of its primary supplier within four hours. Note the order: they didn't buy redundancy everywhere. They found the one service whose failure causes real harm, set a number, broke it on purpose, and fixed what the break revealed.

Frequently asked questions

How is operational resilience different from disaster recovery or business continuity?

They overlap, but the emphasis differs. Business continuity and disaster recovery are largely about recovering after a defined disaster (a fire, a data-centre loss) using a plan. Operational resilience is broader and more outcome-led: it starts from the customer-facing service, assumes disruption is continuous and varied rather than a single named event, and judges success by whether the important service stayed within its impact tolerance, not by whether a recovery plan was followed. Continuity and recovery are tools inside resilience, not synonyms for it.

We're not a bank, does any of this apply to us?

The regulations are finance-specific, but the method is general. Any business has a small set of services its customers genuinely depend on, and those services run on systems and suppliers that will sometimes fail. Naming those services, setting a tolerance for each, mapping the dependencies and testing against a plausible failure is valuable whether or not a regulator is asking. Borrow the questions, skip the compliance paperwork you don't need.

Isn't deliberately breaking production reckless?

It is reckless without preparation, and disciplined with it. Chaos engineering done properly minimises the blast radius, a small slice of traffic, a clear hypothesis, monitoring to see what happens, and a fast way to abort. The alternative isn't a system that never fails; it's a system that fails for the first time during a real incident, at the worst possible moment, with no one watching on purpose. Start small, in a test environment, and earn your way to production experiments.

What does "severe but plausible" actually mean?

It is the test for choosing scenarios. "Severe" means the event genuinely stresses the system, a key supplier down for hours, a region-wide cloud outage, a cyber-incident locking you out of core data. "Plausible" means it could really happen to you, so the lessons transfer. The pairing rules out both the trivial (a single laptop dies) and the fantastical (simultaneous failure of everything), and keeps the testing honest.

Who should own operational resilience?

It cannot live only in IT, because the important services it protects are business outcomes, not systems. The pattern that works is a named senior owner accountable for the resilience of each important service, supported by engineering, operations and risk, with the people who run the service in the room for every game day. Leave it to "the technology team" and you get systems that recover while the customer outcome still fails.

Related in the Toolkit

Resilience overlaps with how you reason about threats in the first place, the same threat-modelling habit that asks "how could this be attacked?" feeds directly into the severe-but-plausible scenarios you test against, and the discipline of continuous improvement is what turns each game-day finding into a permanent fix rather than a forgotten ticket.

Security fundamentals & threat modelling, the structured way to imagine what could go wrong, which is where plausible failure scenarios come from.
Identity & access management, a failed identity provider is one of the most common single points of failure in a modern service chain.
Data privacy & PII handling (GDPR and equivalents), many severe-but-plausible incidents are data incidents, where resilience and privacy obligations collide.
Data retention, residency & sovereignty, where your data lives and how long you keep it shapes what a recovery can actually restore.
Product & data risk, impact tolerances are a risk judgement about how much harm is too much, expressed as a number.
Financial statements (P&L, balance sheet, cash flow), quantifying the cost of downtime is what justifies the investment in redundancy.
Lean, Six Sigma, Kaizen & continuous improvement, the loop that turns each incident and game day into a lasting reduction in fragility.
Hosting & cloud architecture, redundancy, failover and multi-region design are how resilience is built into the platform itself.

Where to go next

"Building operational resilience: impact tolerances for important business services", Bank of England (2019), the foundational paper that reframed resilience around customer-facing services and impact tolerances.
Principles of Chaos Engineering, principlesofchaos.org, the short, canonical statement of the method: steady state, hypothesis, real-world events, minimise the blast radius.
"Operational resilience: insights and observations for firms", FCA, the regulator's own review of what firms did well and badly after the 2025 deadline; a useful reality check on common mistakes.
"Getting Started with Chaos Engineering", Nora Jones, Casey Rosenthal & James Wickett, GOTO 2020 (YouTube), practitioners who built this at Netflix and beyond, on how to actually begin without breaking everything.