Business continuity & disaster recovery: planning for the day it breaks

A database gets corrupted, a region goes dark, a contractor's laptop carries ransomware into the file server, a flood takes out the office. The question is never whether something disruptive happens, it is whether you can keep serving customers while it does, and how much you lose getting back to normal. Business continuity and disaster recovery are the two disciplines that answer that question, and the most expensive mistake leaders make is treating them as a document to file rather than a capability to rehearse.

The quick version

Business continuity (BC) is the wide lens: keeping the whole organisation's critical functions running through a disruption, people, premises, suppliers, processes, not just the technology.
Disaster recovery (DR) is the narrow, technical subset: restoring IT systems and data after they go down. DR serves BC; it is not the same thing.
Two numbers set every target: RTO (how fast you must be back) and RPO (how much data you can afford to lose). Both come from a business impact analysis, not from the IT team's gut.
A plan you have never tested is a hypothesis, not a plan. The single highest-value move is rehearsing the recovery before you need it.

The idea in depth: two disciplines, not one

The words get used interchangeably, and that confusion costs money. The cleanest separation comes from the standard that governs the field. ISO 22301, the standard for business continuity management systems, frames continuity as the organisation's "capability to continue the delivery of products and services within acceptable time frames at predefined capacity during a disruption" (ISO 22301:2019). Notice what that covers: not servers, but delivery of products and services, which depends just as much on whether your people can work, your payment processor is up, and your key supplier is still standing.

Disaster recovery is the part of that picture that lives in IT. The US National Institute of Standards and Technology, in its Contingency Planning Guide (NIST SP 800-34 Rev. 1), defines a disaster recovery plan narrowly as the procedures for relocating and restoring information-system operations after a major disruption, and places that plan inside a family of plans alongside continuity-of-operations, incident response and crisis communications. The useful shift is to stop asking "do we have a DR plan?" and start asking two questions with two owners: can the business keep functioning (continuity, owned by the executive who runs the function), and can we rebuild the technology (recovery, owned by whoever runs the systems)?

flowchart TD
  A(["Business continuity
keep critical functions running"]) --> B(["People & premises
where do staff work?"])
  A --> C(["Suppliers & partners
can they still deliver?"])
  A --> D(["Disaster recovery (IT)
restore systems & data"])
  D --> E(["Backups & replication
meet the RPO"])
  D --> F(["Failover & rebuild
meet the RTO"])

Disaster recovery is one branch of business continuity, not the whole tree. Leaders Loop

An honest limitation. Standards like ISO 22301 and NIST 800-34 describe a thorough, audit-ready process that few small organisations have the time to run in full. The danger is the opposite extreme: deciding that because you cannot do the whole standard, you will do nothing. The standards are a menu, not a contract. A two-person business that knows its three critical functions, keeps tested off-site backups, and has written down who calls whom is already ahead of most.

RTO and RPO: the two numbers the business owns

If you remember one thing from this piece, make it the difference between these two targets, because every technical and budget decision flows from them. Recovery time objective (RTO) is the maximum tolerable time a function can be down before the damage becomes unacceptable. Recovery point objective (RPO) is the maximum tolerable amount of data you can afford to lose, measured in time, if your RPO is one hour, you must be able to recover to a state no more than an hour before the failure.

Here is the trap: these are not numbers the IT team should invent. Both derive from a business impact analysis (BIA), asking, function by function, "what does an hour of this being down actually cost us, in money, in obligations, in trust?" In both ISO 22301 and NIST's seven-step process, the BIA comes before anyone chooses a recovery strategy, because the technology is a consequence of the appetite, not the reverse. Make RTO and RPO a business decision, then, signed off by the function's owner, with the cost of each target laid out honestly. Tighter targets cost more, sometimes a great deal more, and only the business owner can weigh that trade-off.

RTO is how long you can be down. RPO is how much data you can lose. Everything else is implementation.

And the stakes are not abstract. The Uptime Institute's Annual Outage Analysis 2024 found that 54% of organisations said their most recent significant outage cost more than US$100,000, and 20% said it cost more than US$1 million, a figure that has been rising as digital services become more critical and recovery takes longer. That is the number that should set your RTO, not a vague sense that downtime is "bad."

Recovery strategies: you get the resilience you pay for

Once the business has set RTO and RPO, the technical question is which recovery pattern buys those targets at an acceptable cost. The cloud providers have converged on a useful ladder. Amazon's Disaster Recovery of Workloads on AWS whitepaper lays out four tiers, and the same logic applies whether or not you run on AWS:

Backup and restore, keep backups and rebuild from them; cheapest but slowest, with RTO and RPO in hours. Pilot light, data replicated continuously and core infrastructure sitting idle in a second location, ready to switch on. Warm standby, a scaled-down but always-running copy of production that can take traffic immediately, then scale up. Multi-site active/active, full production running in two or more locations at once, recovery time near zero. The AWS whitepaper is blunt about the catch: multi-site active/active "is the most complex and costly approach." You buy down recovery time with money and engineering effort; there is no free tier of resilience.

flowchart LR
  A(["Backup & restore
RTO/RPO: hours · $"]) --> B(["Pilot light
minutes–hours · $$"])
  B --> C(["Warm standby
minutes · $$$"])
  C --> D(["Multi-site active/active
near zero · $$$$"])

The resilience ladder: each rung cuts recovery time and raises cost. Match the rung to the function's RTO, not every system needs the top. Leaders Loop

Tier your systems, then, rather than blanket them. A revenue-generating, customer-facing application may justify warm standby; an internal reporting tool almost certainly does not. Spend top-tier money on a system with a tolerant RTO and you have simply overpaid for resilience nobody needed. Underlying all of it is the discipline of backups themselves, where the durable rule of thumb is the 3-2-1 rule, three copies of your data, on two types of media, with one kept off-site. The phrase was coined by photographer Peter Krogh in his 2005 book The DAM Book: Digital Asset Management for Photographers, and it remains the simplest test of whether your backups can survive the loss of any single place or system. A modern gloss adds a fourth and fifth digit, one copy off-line or immutable, and zero errors on a tested restore, because ransomware now targets the backups themselves.

The discipline most plans skip: testing

The most common failure in this whole field is not a missing plan, it is an untested one. A backup you have never restored is a guess about whether you can recover. NIST 800-34 builds "plan testing, training and exercises" in as a formal step for exactly this reason, and the AWS whitepaper states plainly that you must "regularly assess and test your disaster recovery strategy so that you have confidence in invoking it."

Google made this a discipline with teeth. Its DiRT programme, Disaster Recovery Testing, started by site reliability engineers in 2006, deliberately injects real and simulated failures across the company to surface the risks nobody planned for: the runbook that names a person who left, the failover that depends on the very system that is down, the backup that restores but takes nine hours when the RTO was two. The test is the point; an unexercised recovery capability quietly rots. You do not need Google's budget to borrow the habit, just a regular, scheduled rehearsal: restore a backup to a clean environment and time it; fail a non-critical service over and watch what breaks; run a tabletop where the team talks through a ransomware scenario step by step. You find the gap in the calm of a drill, not the panic of a real incident.

A worked example

Take a mid-sized online retailer, call it Harbour & Co. (Illustrative figures throughout; a teaching example, not a real company.) Its leadership commissions a business impact analysis and finds something uncomfortable: the checkout system, if down, costs an illustrative £8,000 an hour in lost sales and lasting trust, while the internal analytics dashboard costs essentially nothing for a day. Same infrastructure budget, wildly different value.

So they set targets by function, not by server. Checkout gets an RTO of 15 minutes and an RPO of near zero, losing even a few orders is unacceptable, which justifies a warm standby in a second region with continuous database replication. The analytics dashboard gets an RTO of 24 hours and a daily backup-and-restore arrangement costing a fraction as much. Then comes the part most firms skip: every quarter they fail checkout over to the standby during a low-traffic window and time it. The first drill is a near-disaster, failover takes 40 minutes because a DNS change needs a manual approval nobody is awake to give. That discovery, made in a planned test rather than a real outage, is the entire return on the exercise. They automate the approval, and the next drill comes in under target.

flowchart TD
  A(["Business impact analysis
what does downtime cost?"]) --> B{"Critical to
revenue/trust?"}
  B -->|"Yes, checkout"| C(["RTO 15 min · RPO ~0
→ warm standby"])
  B -->|"No, analytics"| D(["RTO 24 hr
→ backup & restore"])
  C --> E(["Quarterly failover drill
found: 40-min DNS gap"])
  E --> F(["Fix & re-test
now under target"])

Targets follow business impact; strategy follows targets; testing proves the lot. Leaders Loop

The lesson Harbour & Co. learned is the one this whole discipline turns on: the plan was only worth anything once they had run it. Everything before the first drill was a well-intentioned assumption.

Frequently asked questions

What's the difference between business continuity and disaster recovery?

Business continuity is the broad capability to keep critical functions running through any disruption, covering people, premises, suppliers and processes, not only technology. Disaster recovery is the narrower, IT-specific job of restoring systems and data after they fail. DR is one component of BC. You can have flawless disaster recovery and still fail at continuity if, say, your staff have nowhere to work or your only payment provider is the thing that is down.

What do RTO and RPO actually mean?

Recovery time objective (RTO) is how quickly a function must be back online before the harm becomes unacceptable. Recovery point objective (RPO) is how much data, measured in time, you can afford to lose, an RPO of one hour means your backups or replication must let you recover to within an hour of the failure. Both should come from a business impact analysis and be signed off by the business owner, because both have direct, sometimes steep, cost implications.

How often should we test our plan?

More often than feels comfortable, and in graduated forms. A tabletop walkthrough is cheap and can run quarterly; a full failover or restore-from-backup test is more involved but should happen at least annually for critical systems. The cadence matters less than the principle: a plan that has not been exercised recently is not a plan you can trust.

Isn't the cloud's redundancy enough on its own?

No, and assuming so is a common, costly error. A single region can still fail, and replication faithfully copies a corruption or a malicious deletion to your replica too. The AWS guidance is explicit that continuous replication does not protect against data corruption unless you also keep point-in-time, versioned or off-line backups. Redundancy protects against hardware loss; it does not protect against bad data or human error. You still need backups, and you still need to test them.

We're small, do we really need all this?

You need the thinking, not the paperwork. The standards describe a heavyweight process built for large, regulated organisations, but the core ideas scale down cleanly: know your two or three critical functions, decide how long you could survive without each, keep tested off-site backups (the 3-2-1 rule is a fine start), and write down who does what when things break. A small business that does those four things is more resilient than a large one with an elaborate plan nobody has ever rehearsed.

Related in the Toolkit

Resilience planning starts from knowing what you are protecting against (security fundamentals & threat modelling) and how badly a failure or breach would hurt (product & data risk), the same impact analysis that drives your RTO and RPO. The cost of meeting those targets, in turn, lands on the same financial statements every other investment does.

Security fundamentals & threat modelling, the threats you model are the disasters you plan to recover from.
Identity & access management, compromised access is a leading cause of the incidents DR exists to recover from.
Data privacy & PII handling (GDPR and equivalents), a breach or data loss is both a recovery event and a regulatory one.
Data retention, residency & sovereignty, where your backups live is a legal question, not just a technical one.
Product & data risk, the impact analysis that sets your recovery targets is risk assessment by another name.
Financial statements (P&L, balance sheet, cash flow), resilience is a cost-benefit trade-off, and the cost of downtime shows up here.
Lean, Six Sigma, Kaizen & continuous improvement, the test-find-fix-retest loop of DR drills is continuous improvement applied to resilience.
Hosting & cloud architecture, the recovery strategies on the ladder are architecture decisions you make up front.

Where to go next

NIST SP 800-34 Rev. 1, Contingency Planning Guide for Federal Information Systems, the clearest free, public, end-to-end walkthrough of the seven-step process, from business impact analysis to testing. Federal in origin, but the method is universal.
Disaster Recovery of Workloads on AWS, the four strategies, the practical reference for the backup/pilot-light/warm-standby/active-active ladder, with the cost and complexity trade-offs spelled out.
ISO 22301:2019, Business continuity management systems, the international standard, useful for understanding the formal vocabulary even if you never certify against it.
"Chaos Engineering + DiRT", AMA with Google and Netflix engineers (YouTube), practitioners who built disaster-testing programmes explain why deliberately breaking your own systems is the only way to trust your recovery.