A dashboard is glowing green. The survey came back at 82% satisfied. The pilot "worked." Every one of those statements can be true and useless at the same time, measured precisely, repeatedly, and about the wrong thing. The job of this guide is to give you the three checks that separate evidence you can bet on from evidence that merely looks like it.
The quick version
- Validity = are you measuring the thing you actually care about? (Right target.)
- Reliability = would you get the same answer if you measured again? (Steady aim.)
- Bias = a systematic tilt that pushes every answer the same wrong way, it does not cancel out, no matter how big your sample gets.
- A measure can be reliable without being valid. It cannot be valid without being reliable. And bias can quietly poison both.
The idea in depth
Picture an archer. Reliability is how tightly the arrows cluster, hit the same spot every time and your aim is consistent. Validity is whether that cluster sits on the bullseye. You can be reliably wrong: a tight group in the bottom-left corner is consistent and useless. That single image carries almost everything a leader needs, and it maps cleanly onto the work. A team can report the same metric every Monday to two decimal places, perfectly reliable, while that metric has nothing to do with whether customers are getting value.
The formal version of this distinction is old and well-settled. Lee Cronbach and Paul Meehl set out the modern theory of construct validity in their 1955 paper in Psychological Bulletin, arguing that when you measure something abstract, "engagement," "leadership potential," "customer effort", you are never measuring it directly. You are measuring a proxy and inferring the real thing, so the validity question is whether that inference holds up against everything else you know. Reliability has an equally concrete tool: Cronbach's alpha, a coefficient that asks whether the items in a multi-question measure hang together and point at one underlying idea. So the move is: before you trust a score, ask the two questions in order. First, "if we ran this again next week, roughly the same answer?" (reliability). Only if yes, "is this score actually about the outcome we care about, or a convenient stand-in?" (validity). A pretty number that fails the first question is noise; one that fails the second is a well-aimed mistake.
flowchart LR
A(["Take a measurement"]) --> B{"Same answer if
we measure again?"}
B -- No --> C(["Unreliable, it's noise.
Don't act on it"])
B -- Yes --> D{"Is it actually about
the outcome we care about?"}
D -- No --> E(["Reliable but invalid,
a well-aimed mistake"])
D -- Yes --> F(["Reliable AND valid,
evidence you can bet on"])
Validity has more than one flavour, and they fail differently
"Valid" is not a single switch. The framework most researchers reach for comes from Donald Campbell and his successors; its fullest statement is Shadish, Cook and Campbell's 2002 book Experimental and Quasi-Experimental Designs for Generalized Causal Inference, which splits validity into four kinds. Two of them matter most to a working leader. Internal validity asks whether the cause you credit is really the cause, did the new onboarding flow lift retention, or did you also change pricing that month? External validity asks whether a finding travels, your pilot worked with thirty hand-picked enthusiasts in one region, but will it survive contact with the indifferent majority? These two are often in tension: the tightly controlled test that nails internal validity (clean cause) is frequently the one that generalises worst (poor external validity). A randomised experiment in an artificial setting and a messy roll-out in the real world are each strong where the other is weak.
So decide which validity you actually need before you design anything. If the question is "does X cause Y," protect internal validity: hold other things constant, keep a control group, change one thing at a time. If the question is "will this work at scale," protect external validity instead, test in conditions that resemble the real world, with people who resemble real customers. As Shadish and colleagues stress, validity is a property of an inference, not of a method; the same clean experiment can support a valid claim about cause and an invalid claim about rollout. The honest limitation: you rarely get both at full strength in one study, which is why a single "it worked" pilot should make you ask "worked at what, and would it travel?" rather than reach for the rollout button.
flowchart TD
Q(["What do you need to know?"]) --> I{"Cause, or scale?"}
I -- "Does X cause Y?" --> IV(["Internal validity:
control group, hold
other things constant"])
I -- "Will it work at scale?" --> EV(["External validity:
realistic conditions,
representative people"])
IV --> T(["Trade-off: the cleaner the test,
the less it tends to generalise"])
EV --> T
Bias is the tilt that more data won't fix
Random error is annoying but democratic, it scatters in all directions and averages out as your sample grows. Bias is different: it is a systematic push in one direction, so collecting more data just gives you a more confident wrong answer. Two families of bias do most of the damage in organisations. The first lives in how you gather evidence. Confirmation bias, a thread running through Daniel Kahneman's Thinking, Fast and Slow (2011) and his earlier work with Amos Tversky, is the tendency to seek, notice and believe what fits what you already think. Kahneman's shorthand, WYSIATI ("what you see is all there is"), names the deeper trap: the mind builds a confident story from the evidence in front of it and never accounts for the evidence it never went looking for. The second family lives in how people answer. Social-desirability bias means respondents tell you what looks good, your engagement survey measures willingness to admit disengagement at least as much as disengagement itself.
The countermove is to build disconfirmation into the process rather than relying on willpower. Before you read the results, write down what would change your mind, then go looking for that specifically. Assign someone to argue the opposite. For self-report data, make answers genuinely anonymous and ask about specific behaviours ("in the last two weeks, did you…") rather than attitudes people feel judged on. There is a structural cousin worth naming too: Goodhart's law, sharpened by Marilyn Strathern in 1997 into "when a measure becomes a target, it ceases to be a good measure." Tie bonuses to a survey score and you have quietly trained people to game the survey, a beautifully reliable measure that has stopped being valid the moment it started to matter. The limitation to sit with honestly: you cannot eliminate bias, only surface and bound it. The replication crisis, the 2015 Reproducibility Project, which reproduced a clear effect in only about a third of 100 published psychology studies, is what happens when a whole field's incentives quietly favour the flattering result.
"When a measure becomes a target, it ceases to be a good measure.", Marilyn Strathern, 1997
A worked example
A support team rolls out an AI reply-assistant and wants to know if it helped. The first dashboard says average handle time fell 18% and the post-chat CSAT score is 4.6 out of 5, all figures here are illustrative. Leadership is ready to expand it everywhere. Run the three checks first.
Reliability: is CSAT steady, or does it swing wildly week to week on a handful of responses? If only 6% of chats leave a rating and the number jumps half a point on volume alone, you have an unreliable measure, stop reading meaning into it. Validity: handle time is reliable and easy, but is it the right target? Faster replies that quietly push customers to give up and self-serve will improve handle time while making service worse, a textbook reliable-but-invalid metric, and a Goodhart trap waiting to happen the moment agents are rewarded on it. The outcome you actually care about, did the customer's problem get solved, is harder to measure, which is exactly why it gets dropped. Bias: the rollout went to the team's keenest agents first (confirmation-friendly external-validity problem, they'd have improved with any new toy), and the CSAT survey fires immediately after chat, when relief is highest and the unhappy have already left (social-desirability and survivorship tilt). The fix isn't a bigger sample. It's a fairer test: a control group of comparable agents, a resolution measure rather than just a speed measure, and a follow-up survey a day later from a neutral sender. The decision changes from "expand it" to "expand it for these cases, watch resolution, and re-check in a month."
Frequently asked questions
What is the difference between validity and reliability?
Reliability is consistency, same answer if you measure again. Validity is correctness, you're measuring the thing you actually care about. Reliability is the steady aim; validity is hitting the bullseye. You need both, and you need reliability first, because an inconsistent measure can't be valid about anything.
Can a measure be reliable but not valid?
Yes, and it's the trap that catches smart teams. A scale that always reads three kilos heavy is perfectly reliable and completely wrong. Most "vanity metrics" are exactly this: easy to measure precisely every week, and not actually about the outcome. Precision is not the same as relevance.
How big does my sample need to be for the research to be valid?
Bigger is not the lever you think it is. A biased sample of ten thousand is worse than a representative sample of two hundred, because more data only sharpens a systematic tilt. Spend your effort on who you ask and how you ask before you worry about how many. (See survey & sampling design for the specifics.)
Does any of this apply to qualitative research?
Yes, in different words. Qualitative work talks about credibility, dependability and confirmability rather than validity and reliability, but the questions are the same: is this trustworthy, would another careful reader reach a similar interpretation, and whose perspective got left out? See qualitative vs quantitative vs mixed methods.
What's the single most useful habit to adopt?
Before you look at any result, write down what would change your mind, then go find that evidence first. It costs two minutes and it's the cheapest available defence against confirmation bias and a flattering-but-wrong conclusion.
Related in the Toolkit
- Qualitative vs quantitative vs mixed methods, the same three checks, translated into the language each tradition uses.
- Survey & sampling design, where sampling bias and leading questions sneak in before you've collected a single answer.
- Interview & ethnographic techniques, how to ask without putting the answer in the respondent's mouth.
- Experiment design (RCTs, A/B testing, quasi-experiments), the control groups and randomisation that protect internal validity.
- Jobs-to-be-Done & needs research, making sure you're measuring the outcome customers care about, not a convenient proxy.
- First principles vs heuristics vs analogical reasoning, the thinking shortcuts that quietly create bias.
- Reversible vs irreversible decisions, how much evidence rigour a call actually warrants.
- Descriptive statistics (mean, median, mode, variance, SD), reading the variance that tells you whether a measure is reliable.
Where to go next
- Cronbach & Meehl, "Construct Validity in Psychological Tests" (1955), the seminal, free full text on why measuring an abstract thing is always an inference. Dense but foundational.
- Daniel Kahneman, Thinking, Fast and Slow (2011), the definitive popular account of confirmation bias, WYSIATI and how confident judgement misleads us.
- Veritasium, "Is Most Published Research Wrong?" (YouTube), a clear 12-minute talk on p-hacking, publication bias and the replication crisis. The best single primer on how bias scales.
- Open Science Collaboration, "Estimating the Reproducibility of Psychological Science," Science (2015), the landmark replication study itself, if you want the primary evidence.