Usability & guerrilla testing: watch five people, fix what trips them

Here is the quiet truth behind most bad checkout flows, baffling dashboards and forms nobody finishes: nobody watched a real person try to use them before they shipped. The team reviewed the design, signed it off, and moved on, and the team is the one group on earth that can't see the problem, because they already know where everything is. Usability testing is the small, almost embarrassingly simple act of fixing that. You give the thing to someone who isn't you and watch what happens.

The quick version

Usability testing means watching real people attempt real tasks with your product, and noting where they struggle, not asking their opinion, but observing their behaviour.
Guerrilla testing is the budget version: a laptop, a few strangers in a café, a coffee as thanks. It trades statistical polish for speed and frequency.
You don't need a crowd. Jakob Nielsen's research found that around five users surface roughly 85% of the usability problems in an interface, so it's better to run three small tests than one big one.
The discipline is in your mouth: ask people to do a task, then stay quiet and let them struggle. The moment you start helping, you stop learning.

The idea in depth

Usability testing grew out of a deliberately cheap idea. Around 1990 Jakob Nielsen made the case for discount usability engineering, the argument that you don't need a lab, a one-way mirror or a research budget to find most of what's wrong with a design; you need a few users, a rough prototype, and someone watching. (Nielsen and Rolf Molich's companion work on heuristic evaluation came out of the same period.) The radical part was the economics. Testing had been treated as expensive and therefore rare. Nielsen's case was that cheap-and-frequent beats elaborate-and-occasional, because the point isn't to measure the design, it's to improve it.

That case rests on a specific, much-cited finding. In his 2000 article “Why You Only Need to Test with 5 Users,” Nielsen put a formula to it: the share of problems found by n users is N(1 − L)n, where L, the proportion a single user uncovers, averages about 31%. So the first user finds nearly a third of the issues, the next few find most of the rest, and by the fifth you're at roughly 85%. A sixth or seventh mostly re-discovers problems you've already seen. You learn an enormous amount from the first person and steadily less from each one after.

flowchart LR
    A(["1 user
~31% of problems"]) --> B(["3 users
~65%"])
    B --> C(["5 users
~85%"])
    C --> D(["10 users
~94%"])
    D --> E(["diminishing
returns"])

Nielsen's curve: each extra tester finds fewer new problems. Five is the practical sweet spot. Leaders Loop

The conclusion Nielsen drew matters more than the number itself. If five users find 85% of problems, the worst thing you can do with a budget for fifteen is spend it all on one fifteen-person study. Far better to run three rounds of five, fixing what you find between each, because the second round tests a better design, and the third tests a better one still. So the move is to stop treating usability testing as a single milestone before launch and start treating it as a rhythm. Steve Krug, in his DIY testing guide Rocket Surgery Made Easy (2010), reduces this to a maxim teams can actually keep: a morning a month, three users, debrief over lunch, agree the fixes before everyone leaves the room. The cadence is the product.

An honest limitation, because the five-user number gets badly over-stretched. It holds for qualitative testing, watching people to find problems, and only when your users are reasonably similar. The instant you have genuinely distinct audiences (say, first-time buyers and power admins, or two very different markets), you need five of each, because they trip on different things. And it does not apply to quantitative questions. If you want to know whether version A converts better than version B, or what percentage of users complete a flow, five people tell you almost nothing, that needs a far larger sample, and NN/g's own guidance is to test at least twenty users before the numbers mean much. Five users find problems. They do not measure them.

Guerrilla testing: the version with no budget left

“We don't have time or money for research” is the most common reason usability testing doesn't happen, and guerrilla testing exists to remove the excuse. The method is what it sounds like: instead of recruiting and scheduling participants, you take a laptop or phone to where people already are, a café, a co-working space, a foyer, and ask a few strangers to try a task in exchange for a coffee or a voucher. You're not after a representative sample. You're after the obvious, brutal problems any reasonable human hits in the first thirty seconds.

Guerrilla testing won't tell you how many people fail. It will tell you, today, that they do, and exactly where.

This works because most usability failures are not subtle. The biggest ones, a button nobody recognises as a button, a label that means one thing to customers and another to the team, show up almost immediately and don't need a controlled environment to surface. Krug's central observation in Rocket Surgery is that the value of watching even one real person is so high, and the cost so low, that the only mistake is not doing it. The trade is explicit: you sacrifice rigour and representativeness for speed and frequency, on the bet that finding ten real problems roughly beats finding three precisely. For an early prototype, that bet is almost always right.

The limitation to name plainly: a café crowd is not your audience. Guerrilla testing is superb for catching the gross problems in a general-consumer flow, and close to useless for a specialist tool whose users are radiologists or commodity traders, random strangers can't attempt tasks they'd never face. It also can't tell you anything quantitative. So treat guerrilla testing as your early, frequent, cheap pass to clear out the obvious failures, and reserve recruited, audience-matched testing for the questions where who the user is genuinely changes the answer. Match the rigour of the method to the stakes of the decision.

flowchart TD
    Q(["What do you
need to know?"]) --> A{"Find problems,
or measure them?"}
    A -->|find problems| B{"Is the user a
specialist?"}
    A -->|measure / compare| M(["Quantitative test
~40+ matched users,
or A/B test"])
    B -->|no, general public| G(["Guerrilla test
5 strangers, a café,
do it this week"])
    B -->|yes| R(["Recruited test
5 matched users
per distinct group"])

Match the method to the question: five strangers find problems; a big sample measures them. Leaders Loop

Whichever version you run, the hardest part is behavioural, and it's the same one professionals fight too: keeping quiet. The natural instinct, the second someone hesitates, is to help, “oh, it's the button on the right.” Every time you do that, you erase the exact data you came for. The skill is to ask a neutral task (“you want to reorder last week's delivery, show me how you'd do that”), then say almost nothing, resisting the urge to rescue them. When they get stuck, the right response isn't a hint; it's a gentle “what are you thinking right now?” The struggle is the finding.

A worked example

Picture a mid-sized insurer, illustrative, not a real case, relaunching the online claim form for a home-contents policy. Internally everyone is happy: the form is logical, on-brand, and matches the new design system. Sign-off is a formality. One product manager, on a hunch, spends a single afternoon doing guerrilla tests in the building's ground-floor café, laptop open, “can I borrow five minutes for a $10 coffee voucher?”

Five people try to start a pretend claim for a stolen laptop. The results (illustrative figures): four of the five get to the second screen and stall on a field labelled “Peril type.” It's perfectly clear to the underwriting team, “peril” is industry-standard for the cause of a loss, and perfectly opaque to everyone else, who read it as jargon and freeze. Three of the five also miss the “Save and continue later” link entirely, assume the half-finished form is lost, and say they'd give up and phone instead, the exact call-centre cost the digital channel was meant to remove.

Neither problem would have surfaced in internal review, because the team knows what “peril” means and knows the save link is there. The fixes are nearly free: rename the field “What happened?” with examples, and make the save option an obvious button. The afternoon cost five coffees and turned up two issues that, left alone, would have shown up months later as abandoned claims and inbound calls, far more expensive to diagnose, and impossible to pin on any single cause. The test didn't measure how many customers would have failed. It proved, cheaply and on day one, that they would.

Frequently asked questions

Is usability testing the same as A/B testing or market research?

No, and confusing them wastes both. A/B testing is quantitative, it measures which version performs better across many users, but won't tell you why. Market research and surveys capture what people say. Usability testing captures what people do, on a small sample, to find out why something is confusing. You use the small qualitative test to find and fix problems, then a larger quantitative test if you need to prove the difference. They answer different questions; run them in that order.

Why watch behaviour instead of just asking people what they think?

Because people are unreliable narrators of their own experience. They'll tell you a flow was “fine” while you've just watched them struggle through it, partly out of politeness and partly because they blame themselves, not your design. Opinions are nearly worthless here; observed behaviour is the gold. Ask people to do something and watch, don't ask them to rate it.

Won't five people miss the rarer problems?

Yes, by design. Five users catch the common, serious problems, roughly 85%, and miss the rare edge cases. That's the right trade, because the common problems are the ones costing you customers. The way to reach the rarer issues isn't a bigger single test; it's another round of five on the improved design. Three rounds of five beat one round of fifteen, because you're testing a better design each time.

Isn't grabbing strangers in a café hopelessly unscientific?

For an early prototype, “unscientific” is the wrong worry. You're not publishing a paper; you're hunting for the obvious problems any reasonable person hits, and a café crowd finds those just fine. The scientific concern becomes real only when the right answer depends on who the user is, a specialist tool, a specific market. Then you recruit to match. Match the method to the stakes.

How do we get the team to act on the results?

Have them watch. A bug report is easy to argue with; watching a customer give up on a screen you designed is not. Krug's advice is to get as many of the team into the observation as possible and to agree the fixes in the debrief, while the discomfort is fresh, before the meeting ends and the momentum drains away. The testing finds the problems; the shared watching is what gets them fixed.

Related in the Toolkit

Design thinking & the double diamond, where testing lives in the wider explore-then-narrow process.
Human-centred design & empathy, why the user's behaviour, not the team's opinion, is the arbiter of good design.
Ideation & co-creation techniques (design studios, affinity mapping, card sorting, crazy-8s), generating the options you'll later put in front of users.
Design sprints, a structured week that ends in exactly this five-user test on Friday.
Information architecture, card sorting and tree testing, the usability methods for structure rather than screens.
Customer needs identification & latent needs, knowing what to test for in the first place.
Design systems & style guides, turning what you learn into reusable, consistent patterns.
Sales process & pipeline management, because a usability failure in a sign-up flow shows up downstream as a leak in the pipeline.

Where to go next

Why You Only Need to Test with 5 Users, Jakob Nielsen, NN/g (2000), the source of the five-user rule, with the formula and the case for many small tests over one big one.
Rocket Surgery Made Easy, Steve Krug (2010), the friendliest DIY playbook: how to run a test yourself, a morning a month, with three users and a script.
Usability Test Demo, Steve Krug, Krug's own recorded test of a real site; watch one before you run your own, to see what “stay quiet and observe” looks like in practice.
How Many Test Users in a Usability Study? NN/g, the honest nuance: when five isn't enough, distinct user groups, and why quantitative studies need many more.