Experiment design: RCTs, A/B tests and quasi-experiments

A team ships a new onboarding flow on Tuesday. Sign-ups rise 8% by Friday. The deck writes itself, except for one inconvenient possibility: Friday is payday, the competitor's site was down on Wednesday, and your marketing email went out Thursday. Did the new flow cause the lift, or did it just happen to be standing nearby when the lift arrived?

The quick version

An experiment exists to separate cause from coincidence. Everything below is in service of that.
A randomised controlled trial (RCT) splits people at random into a change group and a no-change group. Random assignment is what makes the two groups comparable, so any difference is the change's doing.
An A/B test is an RCT run online, at scale, on live users, the same logic, a faster loop.
A quasi-experiment is what you use when you can't randomise: you reconstruct a fair comparison from natural timing or eligibility cut-offs. Weaker, but often the only honest option.

The idea in depth: randomisation is the whole trick

The reason the Tuesday-to-Friday story is unconvincing is a problem researchers call confounding: something other than your change moved at the same time, and you can't tell the two apart. The fix is almost embarrassingly simple, and it is roughly a century old.

In the 1920s and 30s, the statistician Ronald A. Fisher, working on agricultural field trials at Rothamsted in England, made the case that you should assign treatments, which plot gets which fertiliser, at random. He laid this out in The Design of Experiments (1935). The insight: randomisation doesn't just balance the variables you thought of (soil, sunlight); it balances the ones you didn't and couldn't. On average, the lucky and unlucky, the keen and the indifferent, get spread evenly across both groups. So when you see a difference at the end, the change is the only thing left that could plausibly explain it.

So the move is: before you trust any before-and-after number, ask, "compared to what?" If the honest answer is "compared to how things were last week," you don't have an experiment, you have a hope. A real comparison needs a group that didn't get the change but is otherwise just like the group that did.

flowchart TD
    A(["Eligible population"]) --> B(["Randomly assign"])
    B --> C("Control: no change")
    B --> D("Treatment: gets the change")
    C --> E(["Measure outcome"])
    D --> F(["Measure outcome"])
    E --> G(["Difference = the effect of the change"])
    F --> G

The shape of every controlled experiment: random assignment makes the two groups comparable, so the gap at the end is causal. Leaders Loop

A/B testing: the RCT, industrialised

An A/B test is Fisher's idea wearing a hoodie. Live users hitting your product are split at random, version A (the current experience, your control) and version B (the new idea), and you compare a metric that matters: conversion, retention, revenue per visit. Because assignment is random and the sample is large, the same causal logic holds, but the feedback arrives in days, not seasons.

The discipline pays off in a humbling way. In Trustworthy Online Controlled Experiments (2020), Ron Kohavi, Diane Tang and Ya Xu, who ran experimentation at Microsoft, Google and LinkedIn, report that most ideas don't work. Across well-tested products, only a minority of changes move the target metric the way their authors expected; many do nothing, and a meaningful share actively make things worse. Writing earlier in Harvard Business Review (Kohavi & Thomke, 2017), they describe a Bing experiment where a small change to how ad headlines displayed lifted revenue by roughly 12%, over $100 million a year, yet it had sat low on the priority list for months because nobody believed it would matter.

The recurring lesson of large-scale online experimentation, in Kohavi's words: we are poor at assessing the value of ideas, and most ideas fail.

So the move is: treat your roadmap's confidence as a hypothesis, not a fact. The point of testing isn't to rubber-stamp the obvious winners, it's to catch the expensive losers before you ship them to everyone, and to find the unloved 12%-ers nobody backed. This connects directly to reversible vs irreversible decisions: an A/B test is how you make a shippable decision reversible while you learn.

The honest limitation: A/B tests answer narrow, short-horizon questions well and big, slow ones badly. They struggle with effects that take months to show up (loyalty, brand), with changes everyone sees at once (a price change, a rebrand), and with novelty effects, the bump that comes purely from "ooh, something's different" and fades by week three. And running hundreds of tests invites false positives: test twenty things against a 95% threshold and one will look "significant" by chance alone.

Quasi-experiments: when you can't randomise

Plenty of the decisions leaders care about can't be randomised. You can't randomly assign half your employees to a reorg, or half your customers to a recession. When random assignment is off the table, you fall back to quasi-experiments: designs that build a credible comparison group out of natural variation rather than a coin flip.

The canonical reference here is Shadish, Cook and Campbell's Experimental and Quasi-Experimental Designs for Generalized Causal Inference (2002). Two of their workhorses are worth a leader's vocabulary:

Difference-in-differences: roll a change out to one region (or team) but not another, and compare the change over time in each. If both were drifting upward at the same rate before, and the treated group then pulls ahead, that gap is your estimate, it subtracts out the trend they shared.
Regression discontinuity: when a cut-off decides who gets something (a bonus above a score of 80, a discount above a spend threshold), people just above and just below the line are near-identical except for the treatment. Comparing them approximates a randomised test at the boundary.

This family of methods earned a wider audience in 2019, when the Nobel memorial prize in economics went to Esther Duflo, Abhijit Banerjee and Michael Kremer for using field experiments, many of them RCTs, some quasi-experimental, to test what actually reduces poverty (The Royal Swedish Academy of Sciences, 2019). Their lab, J-PAL, has run more than a thousand such trials. The transferable lesson for a manager isn't the economics; it's the posture: replace "we believe X works" with "let's find a fair comparison and check."

flowchart TD
    A(["Can you randomly assign who gets the change?"]) -->|Yes, online at scale| B(["A/B test"])
    A -->|Yes, offline / smaller| C(["Field RCT"])
    A -->|No| D(["Is there a cut-off or staggered rollout?"])
    D -->|Cut-off threshold| E(["Regression discontinuity"])
    D -->|Rolled out to some groups first| F(["Difference-in-differences"])
    D -->|Neither| G(["Treat findings as correlational, not causal"])

A rough decision aid: randomise if you can; reconstruct a fair comparison if you can't; be honest when you can do neither. Leaders Loop

So the move is: before you launch a company-wide change, ask whether you can stagger it, region by region, team by team, month by month. A staggered rollout isn't just politically easier; it hands you a difference-in-differences comparison for free. The limitation to say out loud: quasi-experiments rest on an assumption you can't fully prove, that the groups really would have moved together absent the change. They are evidence, not proof, and they are only as good as that assumption.

A worked example

Say you lead support at a 200-person SaaS company. A vendor pitches an AI reply-assistant; you believe it'll cut handle time. The tempting move is to switch it on for everyone, watch the dashboard, and declare victory. The better move costs a week.

You randomly assign your 40 agents to two groups of 20. Group B gets the assistant; group A keeps working as before. After two weeks you compare median handle time and a quality score (so you don't "win" by closing tickets badly). Suppose, and these numbers are illustrative, group B lands at 7.2 minutes per ticket versus group A's 8.0, with no drop in quality. Because agents were assigned at random, the busy-season surge, the new pricing page, and the influx of confused trial users all hit both groups equally. The 0.8-minute gap is the tool's doing, not the week's.

Now suppose you couldn't split agents, the tool only installs team-wide. You fall back to a quasi-experiment: roll it out to the EMEA team in March and the APAC team in May, and compare each team's handle time against its own pre-rollout trend (difference-in-differences). Weaker, but a great deal more honest than "it felt faster." Either way, you've turned an expensive belief into a cheap test, which is the entire game. (For the question of which metric to trust here, see descriptive statistics, a median resists outliers a single 90-minute ticket would wreck.)

Frequently asked questions

How many people do I need for a valid test?

Enough that a real effect won't be drowned out by noise, which depends on how big an effect you'd care about. Small effects on small samples are unmeasurable; you need either a large sample or a large effect. Before running anything, do a quick power calculation (most analytics tools have a calculator). If the honest answer is "we'd need 50,000 users and we have 800," that's useful to know before you spend the week.

Isn't an A/B test just splitting traffic and looking at the numbers?

That's the easy 80%. The hard 20% is trustworthiness: did assignment stay truly random, did you decide the success metric and sample size before peeking, did you avoid stopping the moment it looked good (which manufactures false positives)? Kohavi's book is largely a catalogue of ways a technically-correct test still lies to you.

What's the difference between correlation and causation here?

Correlation is "these two things moved together." Causation is "this one moved because of that one." A randomised experiment is the cleanest tool we have for getting from the first to the second, because randomisation rules out the rival explanations. Without it, you're observing, which is fine, as long as you don't narrate it as proof.

When is an experiment the wrong tool?

When the decision is genuinely irreversible and a wrong test is catastrophic; when the effect only appears over a horizon you can't afford to wait out; or when running the test is itself unethical or harmful. For those, lean on quasi-experiments, prior evidence, and judgement, and be honest that you're reasoning under more uncertainty.

Can I trust a quasi-experiment as much as an RCT?

Usually not quite, random assignment is the gold standard precisely because it needs the fewest assumptions. But a well-designed quasi-experiment with a plausible comparison group beats a randomised test you can't actually run, and it crushes a confident before-and-after chart. Match the rigour to the stakes.

Related in the Toolkit

Qualitative vs quantitative vs mixed methods, experiments tell you whether something works; qualitative methods tell you why.
Survey & sampling design, how you recruit and split participants decides whether your experiment is sound.
Interview & ethnographic techniques, for generating the hypotheses worth testing in the first place.
Validity, reliability & bias in research, the confounders, false positives and novelty effects that quietly break experiments.
Jobs-to-be-Done & needs research, make sure you're testing a change that addresses a real need.
First principles vs heuristics vs analogical reasoning, experiments are how you check whether your reasoning survives contact with reality.
Reversible vs irreversible decisions, testing is how you keep a big decision reversible while you learn.
Descriptive statistics (mean, median, mode, variance, SD), the maths you need to read an experiment's result without fooling yourself.

Where to go next

Trustworthy Online Controlled Experiments (Kohavi, Tang & Xu, 2020), the practitioner's bible for A/B testing; the companion site links the book and the authors' papers.
"The Surprising Power of Online Experiments", HBR (Kohavi & Thomke, 2017), the short, leader-friendly case for why most intuitions are wrong and experimentation pays.
Esther Duflo, "Social experiments to fight poverty", TED (2010), a 17-minute talk on using randomised trials to replace guesswork with evidence; the clearest argument for the method you'll find.
J-PAL, Introduction to randomized evaluations, a free, rigorous primer on RCTs and their limits from the lab behind much of the field evidence.
The 2019 Nobel Prize in Economic Sciences, official summary, the citation for Banerjee, Duflo and Kremer, and a concise account of why experimental evidence reshaped a field.