Confidence intervals, effect sizes & R-squared, explained

Someone slides a slide across the table: "The new onboarding flow lifted activation by 12%, p < 0.05." Heads nod. The question that should be asked next, but usually isn't, is: 12% give or take how much, and is that worth the engineering it would take to roll out? That single follow-up is the whole of this guide.

The quick version

Confidence interval, the range your result is probably in, not a single lucky point. A wide interval means "we don't really know yet."
Effect size, how big the difference is, in plain units. Statistical significance tells you a difference is real; effect size tells you whether it's worth caring about.
R (the correlation), how tightly two things move together, from −1 to +1. R-squared, the share of the ups and downs in one thing your model actually accounts for.
None of these is a verdict. They're the start of a conversation, and a small sample or a hidden cause can fool any of them.

The idea in depth

For most of the last century, the headline number in any analysis was the p-value, and the ritual was binary: below 0.05, celebrate; above it, bin the idea. That ritual is exactly what statisticians have spent thirty years trying to dislodge, not because p-values are meaningless, but because they answer a narrower question than people think. A p-value tells you how surprising your data would be if there were no real effect at all. It does not tell you how big the effect is, how sure you should be, or whether to act. The numbers below do.

A confidence interval is an honesty range, not a guarantee

Any number from a sample, a conversion rate, an average handle time, an engagement score, is a single draw from a noisier reality. Run the test again next week and you'd get a slightly different number. A 95% confidence interval is the band around your estimate that captures that wobble. The technical definition is slippery and even researchers fumble it: it is not "a 95% chance the true value is in this range." It means that if you repeated the study many times, about 95% of the intervals you'd build would contain the true value. In practice, treat it as your honesty range, the spread of answers consistent with what you saw.

This matters because the width carries the message. "Activation rose 12% (95% CI: 10% to 14%)" and "activation rose 12% (95% CI: −3% to 27%)" report the same headline and mean completely different things. The first is a finding; the second is a shrug wearing a finding's clothes. The statistician Geoff Cumming popularised the "dance of the p-values" to show how wildly a p-value jumps between identical repeated experiments while the confidence interval stays comparatively steady and informative. The practical rule: never let a point estimate travel without its interval. If the interval straddles zero, or straddles the threshold where you'd actually decide differently, you don't have an answer yet, you have a reason to gather more data.

Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect.
, American Statistical Association statement on p-values, Wasserstein & Lazar, 2016

Effect size answers "big enough to bother?"

Significance and importance are different questions, and conflating them is the most expensive statistical mistake leaders make. With a large enough sample, a trivially small difference becomes "statistically significant", true, real, and not worth a sprint. Effect size is the antidote: it measures the magnitude of a difference in interpretable units, independent of how many people you tested.

The psychologist Jacob Cohen built much of the modern vocabulary here. In Statistical Power Analysis for the Behavioral Sciences (1988) and his widely cited "A Power Primer" (1992), he offered rough conventions for one standardised effect size, Cohen's d, roughly 0.2 for a small effect, 0.5 for medium, 0.8 for large, as a way to talk about magnitude rather than mere existence. Cohen was emphatic that these were conventions of last resort, to be used "with much diffidence," not laws of nature. The honest version: prefer the effect in your own units (percentage points, dollars, minutes saved) whenever you can, and reach for a standardised number only to compare across very different measures. What to ask in the room: "how big, in units I run the business on?" before "is it significant?" A 0.3-point lift on a five-point satisfaction scale needs translating into something a budget owner can weigh.

This is also where correlation versus causation quietly bites: a large, tight effect in an observational dataset can still be driven by something you didn't measure. Effect size tells you the difference is big; it never tells you what caused it.

flowchart TD
    A(["Your team reports a result"]) --> B{"Is the difference statistically significant?"}
    B -->|"No"| C(["Treat as no clear signal yet, gather more data"])
    B -->|"Yes"| D{"How wide is the confidence interval?"}
    D -->|"Wide / crosses your decision line"| C
    D -->|"Tight"| E{"Is the effect size big enough to matter in business units?"}
    E -->|"No"| F(["Real but trivial, don't spend on it"])
    E -->|"Yes"| G(["Worth acting on, sanity-check the cause"])

Three gates a number should pass before it changes a decision. Leaders Loop

R and R-squared: how related, and how much explained

When the question shifts from "did A beat B?" to "do these two things move together?", you're in the territory of correlation, written r. It runs from −1 (as one rises, the other falls in perfect lockstep) through 0 (no linear relationship) to +1 (rise together perfectly). An r of 0.6 between ad spend and sign-ups says they tend to move together, fairly but not perfectly.

R-squared (the coefficient of determination) is r squared, and it answers a sharper question: of all the variation in the thing you care about, what share does your model account for? An R-squared of 0.36 means your model explains about 36% of the ups and downs, and, just as importantly, leaves 64% unexplained by other forces. R-squared runs from 0 to 1, and bigger is not automatically better: a high R-squared on past data can mean a model that has memorised noise rather than learned a pattern, and "explained" here is a statistical phrase, not a claim about cause.

So the move is: read R-squared as "how much of the story this model is telling," then immediately ask what the other slice is. A model that explains 36% of revenue swings is not wrong, it's a reminder that most of the swing lives elsewhere, which is often the more useful finding.

flowchart LR
    A(["Total variation in the outcome (e.g. weekly sign-ups)"]) --> B(["Explained by the model, R-squared"])
    A --> C(["Unexplained, everything else (1 minus R-squared)"])
    B --> D(["The lever you can describe and maybe pull"])
    C --> E(["Seasonality, competitors, luck, things you didn't measure"])

R-squared splits the variation you can account for from the variation you can't. Leaders Loop

The limitation worth naming honestly

Every number here can be identical across datasets that tell opposite stories. In 1973 the statistician Francis Anscombe built four small datasets, now known as Anscombe's quartet, with the same mean, variance, correlation and regression line, yet wildly different shapes when plotted: one a clean trend, one a curve, one a straight line wrecked by a single outlier. The lesson holds for everything above: a confidence interval, an effect size and an R-squared are summaries, and summaries hide shape. Before you trust the numbers, look at the picture.

A worked example

The figures below are illustrative, to show the reasoning, not real results.

A support team trials a new help-centre design. After two weeks: tickets per 100 visits drop from 8.0 to 6.8, a 15% reduction, and "significant" at p < 0.05. The room wants to ship it everywhere.

You ask the three questions. Confidence interval: the reduction is 1.2 fewer tickets per 100 visits, 95% CI from 0.2 to 2.2. Real, but the low end (0.2) is close to nothing, the honest range is "somewhere between barely worth it and genuinely good." Effect size: at your volume, the midpoint saves roughly 1,400 tickets a month, about 1.5 full-time agents' worth. That clears the bar. Relationship check: someone notes the trial overlapped a quiet holiday fortnight. You pull a regression: time-of-year alone has an R-squared of 0.4 against ticket volume, seasonality explains a big slice of the swing, so part of your "win" may be the calendar, not the design.

The decision isn't "ship" or "kill." It's "run it four more weeks past the holiday to shrink the interval and rule out seasonality, then ship if the low end of the range still pays for itself." Same data, far better call, because four numbers were read as a conversation, not a verdict.

Frequently asked questions

Is a 95% confidence interval a 95% chance the truth is inside it?

No, that's the most common misreading, and even scientists trip on it. It means that if you repeated the study many times, about 95% of the intervals you'd construct would contain the true value. Day to day, treat it as your honesty range: the set of answers consistent with what you observed. The practical takeaway is the same either way, a wide interval means you don't know much yet.

If a result is statistically significant, isn't it important?

Not necessarily. Significance says a difference is unlikely to be pure chance; effect size says whether it's big enough to act on. With a large sample, even a trivial difference clears the significance bar. Always pair the two, the American Statistical Association's 2016 statement makes exactly this point: a small p-value doesn't imply a large or important effect.

What counts as a "good" R-squared?

It depends entirely on the field. Predicting a machine's output from its settings might demand 0.95; explaining human behaviour, an R-squared of 0.3 can be genuinely useful. Chasing a high number invites overfitting, a model that explains the past beautifully and the future not at all. Ask "explains enough for the decision I'm making?" rather than hunting for a magic threshold.

What's the difference between R and R-squared in one line?

R is the direction and tightness of a relationship (−1 to +1); R-squared is the share of variation your model accounts for (0 to 1). R tells you they move together; R-squared tells you how much of the movement you've actually pinned down.

Can these numbers prove one thing causes another?

No. A strong correlation, a big effect size and a high R-squared can all sit on top of a hidden common cause or a coincidence. These tools quantify how much and how sure; establishing why needs a controlled experiment or careful design. See correlation vs causation.

Related in the Toolkit

Data types (discrete/continuous, categorical/ordinal), which statistic is even valid depends on what kind of number you're holding.
Descriptive statistics (mean, median, mode, variance, SD), confidence intervals and effect sizes are built directly from spread and standard deviation.
Distributions, percentiles & quartiles, the shape behind the summary; why two datasets with the same average aren't the same.
Correlation vs causation, R and R-squared measure association, never cause; this is the guardrail.
Regression (linear, non-linear, logistic), where R-squared comes from, and how the underlying model is fitted.
First principles vs heuristics vs analogical reasoning, Cohen's small/medium/large are heuristics; know when to drop them for first-principles units.
Reversible vs irreversible decisions, a wide confidence interval matters far more when the call can't be undone.
Jobs-to-be-Done & needs research, the qualitative why behind the quantitative how-much.

Where to go next

Geoff Cumming, Understanding the New Statistics (Routledge, 2012), the readable case for leading with confidence intervals and effect sizes instead of p-values.
The ASA statement on p-values (Wasserstein & Lazar, The American Statistician, 2016), six short principles, free to read, that reset how to interpret significance.
Jacob Cohen, "The earth is round (p < .05)" (1994, PDF), the classic, witty takedown of significance-as-ritual and the argument for estimation.
Geoff Cumming, "Dance of the p-values" (YouTube), a five-minute animation showing how unstable p-values are across repeats, and why intervals behave better.
StatQuest, "R-squared, Clearly Explained" (YouTube), Josh Starmer's plain-language build-up of what R-squared actually measures.