Your A/B test comes back: the new checkout flow lifted conversion, "p = 0.03, statistically significant." Someone wants to roll it out by Friday. Before you nod, it helps to know exactly what that number promised, and the three or four things it quietly didn't.

The quick version

  • A p-value answers one narrow question: if there were really no effect, how surprising would this data be? Small p = surprising = probably a real signal.
  • The test statistic (a t-score for comparing averages, a chi-square for comparing counts/proportions) is the engine; the p-value is just the readout.
  • "Significant" means unlikely to be noise. It does not mean large, important, or worth doing. Always ask for the effect size next to the p-value.
  • The 0.05 threshold is a convention, not a law of nature. Treat it as a yellow light, not a verdict.

The idea in depth

Significance testing exists to solve a stubborn problem: random noise can look exactly like a real effect. Flip a fair coin ten times and you will sometimes get seven heads. Did the coin change, or did chance wobble? Significance testing is a disciplined way of answering "how often would pure luck fool me like this?"

The machinery was built by three people across three decades. Karl Pearson introduced the chi-square test in 1900, the first general method for asking whether observed counts depart from what chance alone would produce (Pearson, "On the criterion…", Philosophical Magazine, 1900). Eight years later William Sealy Gosset, a brewer at Guinness publishing under the pen name "Student" because his employer forbade staff from publishing under their own names, derived the t-distribution to handle the small samples a brewery actually works with (Gosset, "The probable error of a mean," Biometrika, 1908). Then Ronald Fisher, in Statistical Methods for Research Workers (1925), popularised the p-value and, almost in passing, the 0.05 line: "it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not."

That "convenient" is doing a lot of work, and it is worth dwelling on. Fisher chose one-in-twenty because it was a tidy round figure for a man working with printed statistical tables, not because nature has a threshold there.

What a p-value actually is (and isn't)

Here is the careful definition: a p-value is the probability of seeing data at least as extreme as yours, assuming there is no real effect. That assumption, "no real effect", is the null hypothesis. A small p-value says the data would be surprising under that assumption, so the assumption looks shaky.

The American Statistical Association felt strongly enough about the misreadings to issue a formal statement in 2016 (Wasserstein & Lazar, The American Statistician). Two of its six principles are the ones leaders most need: a p-value "does not measure the size of an effect or the importance of a result," and "scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."

So the move is: never let a p-value travel alone. Whenever someone hands you "significant," ask the next two questions, how big is the effect, and how confident are we about that size? A 0.1% conversion lift can be "significant" with enough traffic and still not pay for the engineering to ship it. This is the heart of descriptive statistics doing the work the p-value can't: the average, the spread, the confidence interval.

flowchart TD
    A(["Result lands: p < 0.05"]) --> B(["Ask: how big is the effect?"])
    B --> C{"Big enough to matter?"}
    C -->|No| D(["Statistically real, practically trivial, likely skip"])
    C -->|Yes| E(["Ask: how confident are we in that size?"])
    E --> F{"Narrow confidence interval?"}
    F -->|No| G(["Real but uncertain, gather more data"])
    F -->|Yes| H(["Worth acting on"])
					
A p-value is the first gate, not the decision. Effect size and confidence come next. Leaders Loop

Which test, and what the number under the hood means

You rarely compute these by hand, a spreadsheet or a data tool does it. But knowing which test fits which question keeps you from being bluffed by a confident-sounding analyst. It comes down to the kind of data you have, which is itself worth understanding first (see data types).

A t-test compares the averages of two groups when the outcome is a number, revenue per user, handle time, satisfaction score. It produces a t-score: roughly, the gap between the two averages divided by how noisy that gap is. A big t-score means the difference is large relative to the random jitter, which yields a small p-value.

A chi-square test compares counts or proportions across categories, did/didn't convert, support tier A/B/C, churned/stayed. It compares the counts you observed against the counts you'd expect if the categories were unrelated, and a large chi-square (again, a small p-value) says the pattern is hard to explain by chance.

flowchart TD
    A(["What's your outcome?"]) --> B{"A number, or a category?"}
    B -->|"Number (avg revenue, time, score)"| C(["Comparing two group averages?"])
    C -->|Yes| D(["t-test → t-score"])
    B -->|"Category (yes/no, tier, churned)"| E(["Comparing counts across groups?"])
    E -->|Yes| F(["Chi-square test"])
					
Pick the test from the shape of your data, not the other way round. Leaders Loop

An honest limitation, and the one that trips up the most teams: p-values reward large samples. With enough data, almost any difference, however tiny, crosses 0.05. The 2019 ASA follow-up (Wasserstein, Schirm & Lazar, The American Statistician) went further than the 2016 statement, arguing the profession should stop treating "statistically significant" as a bright line at all and move toward judging the full picture of evidence and uncertainty. You don't have to abandon the threshold to take the warning seriously: a significant result on a huge sample is a prompt to look at effect size, not a licence to skip it.

The mirror-image trap is the opposite: with a small sample, a real and important effect can fail to reach significance simply because there wasn't enough data to rule out luck. "Not significant" means "we couldn't tell," not "no effect." Absence of evidence isn't evidence of absence, which is also why a significant correlation still isn't proof of cause (see correlation vs causation).

A worked example

Your support team trials a new onboarding email. You want to know if it reduces first-month churn. (Figures below are illustrative.)

You have two outcomes to look at, and they need different tests:

  • Churned vs stayed is a count, split across two groups (got the email / didn't). That's a chi-square. Say churn was 9.0% in the control group and 7.5% in the email group, across 4,000 customers each. The chi-square test returns p = 0.02, the gap is bigger than chance would comfortably produce.
  • Support tickets per customer in month one is a number. To compare the two group averages you'd run a t-test. Suppose control averaged 1.8 tickets and the email group 1.7, with a t-score that gives p = 0.31, well above 0.05, so you can't rule out noise.

Read together, these tell a sharper story than either alone. The churn effect is significant and meaningful: 1.5 percentage points on month-one churn is real money. The ticket effect is small and not significant, don't claim it. The move is to ship the email for its churn benefit, label the ticket finding as inconclusive rather than dressing it up, and resist the temptation to re-run the analysis ten different ways until something else crosses 0.05. That last habit, "p-hacking", manufactures significance out of pure chance, and it's the fastest way to get fooled by your own data.

Frequently asked questions

Does p < 0.05 mean there's a 95% chance the result is real?

No, this is the most common misreading. The p-value is the probability of the data assuming no effect, not the probability of the effect given the data. A p of 0.03 doesn't mean "97% likely to be true." It means "if there were no effect, data this extreme would show up about 3% of the time."

What's the difference between statistical and practical significance?

Statistical significance asks "is this probably more than noise?" Practical significance asks "is it big enough to act on?" They come apart constantly: huge samples make trivial differences significant, and small samples can hide important ones. Always read the effect size next to the p-value.

Is 0.05 a magic number?

No. Fisher picked it for convenience in 1925, and the ASA has twice cautioned against treating it as a verdict. For a cheap, reversible change you might happily act on weaker evidence; for an expensive, hard-to-undo one you'd want a much smaller p-value plus a large effect (see reversible vs irreversible decisions).

When do I use a t-test versus chi-square?

Use a t-test when your outcome is a number and you're comparing two group averages (revenue, time, score). Use chi-square when your outcome is a category and you're comparing counts or proportions (converted/didn't, churned/stayed). The data type decides, not the question's wording.

If a result isn't significant, does that prove there's no effect?

No. "Not significant" usually means the sample was too small to tell, or the effect too subtle to separate from noise. It's "we couldn't detect it," not "it isn't there." Report it as inconclusive, and if it matters, collect more data.

Related in the Toolkit

Where to go next