Quantitative risk & stress testing: putting numbers on what could go wrong

Ask a risk team "how much could we lose?" and you will usually get a tidy answer: a dollar figure, a confidence level, a one-day horizon. The tidiness is the danger. The number is real, but it describes ordinary days, and it is the un-ordinary day that closes a business. Quantitative risk measurement and stress testing are the two halves of an honest answer: one estimates loss under normal conditions, the other deliberately breaks those conditions to see what survives.

The quick version

Quantitative risk means putting a number on potential loss. The best-known measure is Value at Risk (VaR): the most you'd expect to lose over a set period, at a set confidence level, under normal conditions.
VaR's famous flaw is that it says nothing about how bad the bad days get, it draws a line and ignores everything past it. Expected Shortfall (the average loss beyond that line) was designed to fix exactly that.
Scenario and stress testing abandon the "normal conditions" assumption on purpose: you pick a plausible-but-severe situation and ask what it would do to you. Reverse stress testing flips the question, start from "what would break us?" and work backwards.
The leadership trap is mistaking a precise number for a safe one. The output of all this is not certainty; it is a sharper list of questions for the board to argue about.

The idea in depth: one number, and what it leaves out

The habit of compressing risk into a single figure has a specific origin. In the late 1980s, JPMorgan chairman Dennis Weatherstone grew tired of thick, slow risk reports and asked a blunt question: how much could the bank lose by the end of tomorrow? The answer became the firm's daily "4:15 report", named for the time it landed on his desk, and in October 1994 JPMorgan published the methodology to the world for free as RiskMetrics, making Value at Risk the industry's default language of risk (history via OpenGamma and Wikipedia). The appeal is obvious: one sentence a non-specialist can act on, "we are 99% confident we will not lose more than X tomorrow."

So use VaR for what it is good at, a consistent, comparable, daily gauge of routine market risk that a board can track over time, and refuse to read it as a worst case. VaR is a threshold, not a ceiling. "99% one-day VaR of £2m" means that on roughly one trading day in a hundred you expect to lose more than £2m, and it is silent on whether "more" means £2.1m or £200m. Treat the figure as a thermometer, not a seatbelt.

An honest limitation, and a serious one. VaR has two structural weaknesses no computing power fixes. First, most implementations assume returns are roughly normally distributed, when real markets have fat tails: extreme moves happen far more often than the bell curve predicts, so VaR understates the rare, ruinous event. Second, VaR is not subadditive, the VaR of a combined portfolio can exceed the sum of its parts', perversely penalising diversification. Nassim Taleb went further: in 2009 testimony to the U.S. House Science Committee he argued tail risks are not reliably measurable at all, and that a confident number encourages more risk-taking by anchoring people to a false limit. You needn't share his absolutism to take the warning: the number's precision is not its accuracy.

Expected Shortfall: measuring the tail you ignored

The cleanest response to VaR's "it ignores how bad the bad day gets" problem is a measure that looks past the threshold rather than stopping at it. That is Expected Shortfall (ES), also called Conditional VaR: the average loss given that you have blown through the VaR line. Where VaR says "the bad 1% starts here," ES says "and when you're in that 1%, here is what you typically lose."

This isn't just intuitively nicer; it has a theoretical backbone. Artzner, Delbaen, Eber and Heath's "Coherent Measures of Risk" (Mathematical Finance, 1999) set out four properties any sensible risk measure should have, including subadditivity, the property VaR lacks. Expected Shortfall satisfies them; VaR does not. That edge is one reason the Basel Committee's market-risk framework shifted the bank standard from a 99% VaR toward a 97.5% Expected Shortfall. If you have the data, the sensible thing is to report ES alongside VaR, not instead of it, VaR for the comparable daily gauge, ES for the question that keeps you up at night: how deep the hole goes once you're in it.

The limitation that survives. Expected Shortfall is a better-behaved measure, but it is still a measure built on history and assumptions. If your data never contained a 2008 or a March 2020, ES will average a tail it has never actually seen. A coherent measure of a mis-estimated distribution is still mis-estimated. Which is the whole reason the next tool exists.

Scenario and stress testing: leaving the model behind

VaR and ES both ask "what does our data imply?" Stress testing asks a different, humbler question: "what if our data is the wrong guide?" Instead of inferring risk from the recent past, you specify a severe situation and trace its consequences. The discipline splits into two moves, and it helps to keep them distinct.

Scenario analysis runs your positions through a coherent, named state of the world, a historical replay (what would 2008 do to today's book?) or a hypothetical (a key supplier fails, rates jump 300 basis points, your largest customer leaves). Reverse stress testing inverts the logic: rather than starting from a scenario and computing the damage, you start from the damage you cannot survive, insolvency, a covenant breach, a liquidity wall, and work backwards to find the combinations of events that would get you there. The Basel Committee's Stress Testing Principles (BCBS, October 2018) treats both as core, and frames good stress tests as plausible, severe, and, crucially, suggestive of action rather than academic.

flowchart LR
  A(["Quantify routine risk
VaR / Expected Shortfall"]) --> B{"Trust the
distribution?"}
  B -->|"Normal days"| A
  B -->|"What if it's wrong?"| C(["Scenario analysis
replay 2008 / pick a shock"])
  C --> D(["Reverse stress test
what would actually break us?"])
  D --> E(["Management action
limits, hedges, capital, plan"])

The two halves of an honest risk picture: measure the normal, then deliberately break it. Leaders Loop

The most credible public example at scale is regulatory bank stress testing. Each year the U.S. Federal Reserve publishes baseline, adverse and severely adverse scenarios, sharp recessions, market crashes, unemployment spikes, and tests whether big banks would stay above their capital minimums under each. The Fed is explicit that these "are not forecasts" but hypotheticals designed to probe resilience. That is the lesson a leader in any sector can borrow: a stress test isn't there to predict the future, but to find out whether you'd still be standing if a bad-but-plausible one arrived.

A stress test is not a forecast. It is a rehearsal for a future you hope never comes, run while you still have time to change the ending.

Where this breaks down. Stress tests are only as good as the imagination behind them, and people are reliably bad at imagining the thing that actually happens. A scenario library built from recent crises fights the last war. Reverse stress testing partly defends against this, it forces you to confront real breaking points rather than favourite worries, but can't conjure a genuinely novel cause. So use stress testing to widen the conversation, not narrow it: run a deliberately uncomfortable scenario, invite the people who'll say "that's not the one I'd worry about," and treat the disagreement as the most valuable output.

A worked example

Take a mid-sized importer, call it Harbourgate, that buys in US dollars and sells in pounds. (Illustrative figures throughout; this is a teaching example, not a real company.) Its finance team reports a 95% one-month VaR on currency exposure of, say, an illustrative £400,000: "95% confident we won't lose more than £400k to FX moves next month." The board sees the number, nods, moves on. That is exactly the failure mode.

Read correctly, that VaR says nothing about the 1-month-in-20 when the loss exceeds £400k. So the risk lead adds two things. First, an Expected Shortfall estimate: in those worst months the average loss is more like £900,000, a number that changes the conversation, because it more than doubles the figure the board had filed under "manageable." Second, a reverse stress test: what FX move would breach the firm's banking covenant? The answer comes back uncomfortably close to a move sterling has made in living memory.

flowchart TD
  A(["95% VaR: lose < £400k
(illustrative)"]) --> B{"Board reads it
as the worst case?"}
  B -->|"Yes, false comfort"| C(["No action.
Covenant exposed"])
  B -->|"No, ask what's past the line"| D(["Expected Shortfall
~£900k in bad months"])
  D --> E(["Reverse test: what FX move
breaches the covenant?"])
  E --> F(["Action: hedge to the
breaking point, not the average"])

The same exposure, two readings, one files it under "fine," the other finds the covenant. Leaders Loop

Nothing here required a quant overhaul. It required reading the first number honestly, asking what lies beyond it, and naming the specific event that would actually hurt. Harbourgate now hedges to the level that protects the covenant, not to the comfortable average, and the board's risk conversation has shifted from "what's the number?" to "what would break us, and have we done something about it?" That shift is the entire value of the discipline.

Frequently asked questions

What does Value at Risk actually tell me, in one sentence?

The largest loss you would expect not to exceed over a chosen period, at a chosen confidence level, under normal market conditions. The three caveats, "expect," "chosen confidence," "normal conditions", are doing enormous work. A 99% VaR is breached, by design, about one day in a hundred, and it never tells you how bad that day is.

What's the difference between VaR and Expected Shortfall?

VaR marks the threshold where the bad tail begins; Expected Shortfall measures the average loss inside that tail. ES is the more conservative and mathematically better-behaved measure, it captures the depth of the hole, not just its edge, which is why Basel's bank-capital rules moved toward an Expected-Shortfall standard. Report both if you can: VaR for tracking, ES for the worst case.

Is scenario analysis the same as stress testing?

Scenario analysis is one technique inside stress testing. Stress testing is the broad practice of examining performance under severe conditions; scenario analysis does it by running a specific, coherent event (a historical replay or a hypothetical shock) through your positions. The other main approach is reverse stress testing, which starts from the outcome you can't survive and finds the events that cause it.

We're not a bank, does any of this apply to us?

The maths is heaviest in finance, but the logic is universal. Any organisation can ask: what single number are we treating as a worst case when it isn't? What would a 2008-scale shock do to our cash, supply chain or biggest customer? What event would actually break us, and have we done anything about it? You can run a serious reverse stress test in a workshop with a whiteboard, no covariance matrix required.

If the models are this flawed, why bother quantifying at all?

Because a flawed number used honestly beats a vague feeling used confidently. The error is not in measuring risk; it is in mistaking the measurement for the risk. Quantification gives you something consistent to track and argue about; stress testing keeps you humble about its limits. Used together, they turn risk from a gut feeling into a structured conversation, which is exactly what a board is for.

Related in the Toolkit

Quantitative measurement only matters if it feeds a decision, which is why it sits inside the wider machinery of enterprise risk management & risk appetite, the numbers are how you tell whether you are inside the appetite the board actually set, and why the output belongs on a risk register with an owner and an action, not in a spreadsheet nobody opens.

Enterprise risk management & risk appetite, quantitative limits only mean something against a stated appetite; this is where that appetite is defined.
Risk identification & assessment (likelihood x impact), the qualitative cousin; stress testing is what you reach for when likelihood-times-impact feels too coarse for the tail.
Risk registers & mitigation strategies, where a stress-test finding becomes an owned action rather than an interesting chart.
Operational, financial, strategic & reputational risk, VaR and ES are sharpest on financial risk; the other categories need scenario thinking more than maths.
Three lines of defence & risk governance, who builds the model, who challenges it, and who audits it: the governance that stops a number going unquestioned.
Board roles, committees & responsibilities, the risk committee is the room where stress-test results should land and be argued over.
Employment law basics, a reminder that not every material risk is financial or modellable; some live in compliance and people.
Insurance & risk transfer, once a stress test reveals an exposure you can't carry, transfer is often the answer.

Where to go next

"Stress testing principles", Basel Committee on Banking Supervision (2018), short, readable, and the authoritative statement of what good stress testing looks like, including reverse stress testing; written for banks but useful far beyond them.
"Coherent Measures of Risk", Artzner, Delbaen, Eber & Heath (1999), overview, the paper that defined the four properties a risk measure should have, and showed VaR fails one of them; the theoretical case for Expected Shortfall.
Federal Reserve, Dodd-Frank Act Stress Tests, see real baseline / adverse / severely adverse scenarios applied to real institutions; the clearest public example of disciplined scenario design.
Nassim Nicholas Taleb, "What is a Black Swan?" (YouTube), a short, sharp statement of why the rare, un-modelled event is the one that matters, and why over-trusting a tidy risk number is dangerous.