Growth experimentation & test-and-learn: a practical guide

In a meeting room somewhere, the highest-paid person is about to decide which growth idea ships. Their instinct is good, their track record is real, and they are still wrong more often than they think. The whole point of test-and-learn is to stop guessing whose hunch is best and start letting customers vote with their behaviour, cheaply, quickly, and before the idea has cost you a quarter.

The quick version

Test-and-learn means treating a growth idea as a hypothesis and running a controlled experiment (often an A/B test) to see whether it actually moves a metric, instead of debating it or shipping it on faith.
Most ideas fail. At Microsoft, only about a third of well-designed experiments improved the metric they targeted; Stefan Thomke reports that for every online experiment that succeeds, nearly ten don't.
That's the argument for experimenting, not against it, Bing's best revenue idea ever (a tiny ad-headline change, worth roughly $100M a year) had been shelved as low-priority until someone tested it.
The move: pick one clear metric, write the hypothesis down before you start, run a fair comparison with a control group, and decide in advance what result makes you ship, kill or iterate.

The idea in depth

Test-and-learn is a small word for a big shift in how growth decisions get made: from argument to evidence. Instead of "which idea do we believe in?", you ask "what's the smallest fair test that would tell us if this works?", then you run it, watch real customer behaviour, and let the result decide. The mechanics borrow from the scientific method: a hypothesis, a controlled comparison, a pre-declared way to read the outcome.

flowchart LR
    H(["Hypothesis: if we change X,
metric Y will move"]) --> B(["Build the smallest
fair test"])
    B --> M(["Measure vs a control
group (A/B test)"])
    M --> L(["Learn: ship, kill,
or iterate"])
    L -->|next bet| H

The test-and-learn loop, a hypothesis, a controlled comparison, a decision, and the next bet. Leaders Loop

Why opinion is a bad way to pick growth bets

The uncomfortable finding underneath this whole discipline is that most growth ideas don't work, and we're bad at telling which ones will in advance. Ron Kohavi, who ran experimentation at Amazon, Microsoft and later Airbnb, has reported for years that at Microsoft only about one-third of well-designed experiments actually improved the metric they were built to improve; another third did nothing, and a third made things worse. Stefan Thomke of Harvard Business School puts the broader pattern bluntly in his 2020 article "Building a Culture of Experimentation": "for every experiment that succeeds, nearly 10 don't." These aren't bad teams. They're the best-instrumented product organisations on earth, and they still find that the modal idea is a dud.

That should change how you feel about your own roadmap. If two-thirds of carefully-chosen ideas fail at Microsoft, the confident bullet points in your growth deck are not a plan, they are a list of guesses with good PowerPoint.

So stop adjudicating ideas by seniority and start treating them as hypotheses. The move is cultural more than technical. When someone proposes a growth change, and that someone includes you, the next question isn't "do we like it?" but "what's the test, what's the metric, and what result would change our minds?" People sometimes call that reframing "killing the HiPPO," the highest-paid person's opinion. Done well, it's most of the value before you've run a single test.

Small tests can hide enormous value

If most ideas fail, why bother? Because the experiments that do win can be wildly asymmetric, and you usually can't tell the giants from the duds by looking. The canonical example comes from Kohavi and Thomke's 2017 Harvard Business Review article "The Surprising Power of Online Experiments." A Bing employee suggested a minor tweak to how ad headlines displayed, lengthening the title by folding in text from the line below it. It was judged low-priority and sat on the backlog for months. When an engineer finally ran it as a quick A/B test, it lifted revenue by about 12%, roughly $100 million a year, and became, in their words, the best revenue-generating idea in Bing's history.

"For every experiment that succeeds, nearly 10 don't.", Stefan Thomke, Harvard Business Review, 2020

Read that the wrong way and you conclude "ship more tweaks." The real point is narrower: a controlled experiment is cheap, and being wrong about which ideas matter is expensive. Experimentation buys you a lot of cheap lottery tickets on asymmetric upside while capping the downside at the cost of the test. It's the same engine behind Eric Ries's Lean Startup (2011) and its build-measure-learn loop: the unit of progress isn't shipping features, it's validated learning, turning an assumption into a fact before you've spent the quarter on it.

Lower the cost of being wrong, then. Build the cheapest honest version of the test, a fake-door button, a landing page, a single segment, a one-week holdout, rather than the full feature. The first build exists to teach you something, not to launch. (We unpack that mindset in growth loops and flywheels, where compounding only kicks in once you know which loop actually turns.)

The honest limitation: a bad experiment is worse than none

Here is where test-and-learn breaks down. A controlled experiment is only trustworthy if it's run properly, and the failure modes are easy to fall into. Kohavi, Diane Tang and Ya Xu devote much of their 2020 book Trustworthy Online Controlled Experiments to exactly this: results that look significant but vanish on replication, metrics that move for reasons unrelated to your change, peeking at the data and stopping the moment it looks good, and chasing tiny "wins" that are really statistical noise. Run enough sloppy tests and you'll confidently ship a string of changes that do nothing, or harm.

Two more honest caveats. First, experimentation needs traffic: a low-volume B2B product with a handful of deals a quarter can't A/B test its way to truth the way a consumer site with millions of sessions can, there, smaller-sample methods and judgement carry more weight. Second, experiments optimise within the current design; they're brilliant at finding the better button and useless at telling you to build a different product. Test-and-learn sharpens the idea you have. It doesn't generate the bold one.

So the discipline is mostly about protecting the result's trustworthiness. Decide the metric and the minimum effect that counts before you start. Pre-register the hypothesis. Don't stop the test early because the numbers look nice. And treat a "flat" result as real information, not a failed test. A practice that only believes its wins isn't a discipline, it's confirmation bias with a dashboard.

A worked example

The figures below are illustrative, chosen to show the mechanics rather than to report a real company.

Imagine a subscription fitness app where the growth team is split. The head of marketing is sure that adding social proof, "Join 200,000 members", to the signup page will lift conversions. A product manager thinks the real problem is that the page asks for a credit card too early. Both are plausible; both have advocates; the old way would be to argue it out and ship whichever voice was loudest.

Instead they write two hypotheses down. H1: adding the member count to signup lifts free-trial starts by at least 5%. H2: moving the credit-card request to after the first workout lifts trial starts by at least 5%. They agree the metric, the minimum effect worth shipping (5%), and the run length up front, no peeking, no stopping early.

flowchart TD
    A(["100,000 visitors
split 3 ways"]) --> C(["Control
current page"])
    A --> V1(["Variant 1
social proof"])
    A --> V2(["Variant 2
delayed card ask"])
    C --> R(["Read results vs control,
against the pre-set 5% bar"])
    V1 --> R
    V2 --> R
    R --> D(["Ship the delayed card ask;
kill social proof; iterate next"])

One question, a fair three-way split, a pre-declared decision rule. Illustrative figures. Leaders Loop

The result surprises both camps. Social proof (the marketer's favourite) moves trial starts by less than 1%, inside the noise, not worth shipping. The delayed credit-card request lifts them by 9%, comfortably clearing the bar. The team ships the card change, parks the social-proof idea, and, crucially, nobody loses face, because the customers decided, not the most senior voice in the room.

The win here isn't the 9%. It's that a confident, expensive idea got killed for a tenth of what shipping it blind would have cost, and the team now disagrees about the next bet instead of the last one.

Frequently asked questions

Is test-and-learn just A/B testing?

A/B testing is the most common tool, but test-and-learn is the broader habit: framing growth ideas as hypotheses and using evidence to decide. The evidence might be a randomised A/B test, a holdout group, a painted-door test (a button that measures intent before the feature exists), a geo-based rollout, or a quick qualitative study. The discipline is "what would change our minds, and how cheaply can we find out?", the specific test is whatever fits the question and your traffic.

We don't have millions of users. Can we still do this?

Yes, but adapt the toolkit. Classic A/B testing needs enough traffic to detect a real effect; below that, you'll mostly measure noise. Low-volume teams lean harder on qualitative experiments (concierge tests, fake-door tests, customer interviews), before-and-after comparisons with sensible caution, and rolling a change out to one segment or region as a quasi-experiment. The hypothesis-and-decision-rule discipline matters even more when you can't lean on large numbers.

If most experiments fail, isn't this a waste of effort?

That reverses the logic. A high failure rate is evidence that ideas are genuinely uncertain, which is exactly when testing pays off, because the alternative (shipping them all on faith) means shipping the two-thirds that don't work and never knowing. The cost of an experiment is small; the cost of confidently building the wrong thing for a quarter is not. You're buying information, and information about which ideas are duds is valuable.

How do I avoid fooling myself with the results?

Decide the success metric and the minimum meaningful effect before you run; don't stop the moment the numbers look good (peeking inflates false positives); be suspicious of surprisingly large wins (Twyman's law: any figure that looks too good to be true usually is); and re-test the ones that matter. Kohavi, Tang and Xu's Trustworthy Online Controlled Experiments is the practical reference for these traps.

Won't an experimentation culture slow us down?

It feels slower for one idea and is faster across many. The drag isn't the test, a well-run experiment ships alongside the work, it's the cultural shift away from "the senior person decides." Organisations that test at scale (Thomke cites Booking.com running some 25,000 tests a year) move faster precisely because they're not relitigating opinions; the data settles the argument and the team moves to the next bet.

Related in the Toolkit

Growth-lever framework (acquisition, activation, retention, monetisation, referral), the map of where to point your experiments; pick the weakest lever, then test into it.
Growth loops, flywheels & compounding, experiments are how you find which loop actually turns before you invest in spinning it.
Recurring-revenue metrics (ARR/MRR waterfall, Rule of 40, magic number, CAC payback), the financial scoreboard that tells you which metric your experiments should be moving.
Net & gross revenue retention (NRR/GRR) & expansion economics, retention experiments often beat acquisition ones; this is how you know.
Upsell, cross-sell & land-and-expand, a rich field for expansion experiments once you can read the value signals.
Customer needs identification & latent needs, where the hypotheses come from; testing sharpens an idea, it doesn't invent it.
Design sprints, a structured five-day way to generate and pre-test a hypothesis before you commit to building it.
Engagement, retention & loyalty programs, the onboarding and engagement nudges that experimentation is especially good at tuning.

Where to go next

"The Surprising Power of Online Experiments", Kohavi & Thomke, HBR (2017), the short, persuasive case for experimenting, including the Bing ad-headline story; the best single piece to send a skeptical executive.
"Building a Culture of Experimentation", Stefan Thomke, HBR (2020), why the blocker is culture, not tooling, and what an experimentation-friendly organisation actually does differently.
Trustworthy Online Controlled Experiments, Kohavi, Tang & Xu (Cambridge, 2020), the definitive practical handbook on running A/B tests you can actually trust, traps and all.
The Lean Startup, Eric Ries (2011), where build-measure-learn and "validated learning" entered the mainstream; the philosophy behind the mechanics.
"Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 Years", Ron Kohavi (video), a practitioner keynote on what really happens when you run experiments at scale, including the counter-intuitive results and common pitfalls.