Almost every manager believes they are a good judge of people. The uncomfortable finding from a century of selection research is that the open-ended interview we lean on to make that judgement tells us far less than we think, and that a handful of duller-looking methods quietly outperform it. Recruiting and assessing talent well is less about reading the room and more about building a process that resists how easily a room can fool you.

The quick version

  • Recruiting is attracting and sourcing candidates; assessing is predicting which of them will actually do the job well. They are different problems, and most hiring pain comes from being good at the first and casual about the second.
  • Decades of meta-analysis put structured interviews, cognitive-ability tests and work-sample tests near the top for predicting performance, and the unstructured "let's just chat" interview well below them.
  • "Structured" means every candidate gets the same questions, scored against the same rubric. That one change is the highest-leverage move available to most hiring teams, and it costs nothing but discipline.
  • No single method is decisive. Combine a few that measure different things, decide your scoring before you meet anyone, and treat the interview as evidence to weigh, not a verdict to defend.

The idea in depth: most of what we trust in hiring is noise

The foundational evidence here is a meta-analysis by Frank Schmidt and John Hunter, "The Validity and Utility of Selection Methods in Personnel Psychology" (Psychological Bulletin, 1998), which pooled 85 years of research to ask a blunt question: of all the things we use to choose people, which ones actually predict how well they later perform? It reported general mental ability and well-built assessments as strong predictors, and, crucially, found that structured interviews substantially outpredicted unstructured ones. The combinations that worked best paired a cognitive measure with either a work sample, an integrity test, or a structured interview.

That study became gospel, and it is worth knowing the gospel has since been revised. A 2022 reappraisal by Paul Sackett and colleagues, "Revisiting Meta-Analytic Estimates of Validity in Personnel Selection" (Journal of Applied Psychology), argued that older estimates were inflated by an over-aggressive statistical correction, and recalculated the numbers downward by roughly a tenth to a fifth across the board. The headline reshuffle: structured interviews came out top, ahead of cognitive-ability tests, with work-sample and job-knowledge tests close behind. The two studies disagree on the exact figures, which is the honest state of the field, but they agree on the practical ranking, and that is what you act on.

The interview you remember most vividly is rarely the one that predicted best. Vividness and validity are not the same thing.

The practical shift is to stop ranking methods by how confident they make you feel and start ranking them by what the evidence says they predict. The unstructured interview survives not because it works but because it feels like it works: it is sociable, it flatters our belief that we read people well, and a single charismatic exchange overwrites a stack of duller data. Replace "did I like them?" with "did they clear the same bar everyone else was held to?" and you have already removed most of the damage a bad hiring process does.

What "structured" actually means, and why it bites

Structure is not a synonym for "rigid" or "robotic." It means three concrete things: every candidate for a role is asked the same core questions; those questions are tied to the competencies the job actually needs; and answers are scored against a defined rubric, a scale where you write down, in advance, what a weak, adequate and strong answer looks like. Google's people-analytics team landed on exactly this after reviewing its own hiring at scale: its re:Work guide to structured interviewing documents the shift to consistent questions and behaviourally-anchored rating scales, and former HR chief Laszlo Bock makes the same case in Work Rules! (2015), that the brainteasers and free-form chats Google once prized predicted nothing, and standardised questions predicted success.

flowchart TD
  A(["Define the job:
3–5 real competencies"]) --> B(["Write the same questions
for every candidate"]) B --> C(["Write the rubric first:
weak / ok / strong answers"]) C --> D(["Each interviewer scores
independently, then meets"]) D --> E(["Decide on the evidence,
not the loudest opinion"])
A structured interview is a pipeline you build before you meet anyone, the order matters as much as the questions. Leaders Loop

Two question types do the heavy lifting. Behavioural questions ask what someone actually did ("tell me about a time you shipped something late, what happened?"), on the logic that past behaviour forecasts future behaviour. Situational questions pose a realistic dilemma the role will throw up ("a key stakeholder rejects your plan the day before launch, walk me through your next hour"). Both beat "so, tell me about yourself," because both force a candidate onto the ground the job is actually played on.

So the move, if you do nothing else this quarter: pick your next open role, write four or five questions mapped to what the job truly demands, write down what a good answer to each contains, and ask every candidate the same set. It is unglamorous and it works, which is roughly the whole theme of this topic.

An honest limitation. Predictive validity is a statement about averages across many hires, not a guarantee about the one person in front of you. A method that "predicts performance" still misses individuals in both directions, and the published coefficients are contested enough that two respected meta-analyses a generation apart give materially different numbers. Structure also has a failure mode of its own: over-engineer it into a box-ticking script and you lose the human judgement that spots the candidate who answers an unasked, better question. Treat structure as a way to make judgement fairer and more comparable, not as a way to abolish it.

Assessing for fairness, not just for fit

"Hire for culture fit" is one of the most quietly dangerous instructions in management, because in practice it often means "hire people who remind me of me." The same unstructured interview that predicts performance poorly is also where bias does its best work: with no fixed questions and no rubric, an interviewer's impression drifts toward familiarity. The behavioural economist Iris Bohnet, in What Works: Gender Equality by Design (2016), makes the case that the fix is rarely to lecture individuals about their biases, it is to redesign the process so bias has fewer places to enter. Her widely-cited example is the orchestra screen: when musicians auditioned behind a curtain, the share of women advancing rose sharply. The lesson generalises, change the system, not just the mindset.

Her practical translation for hiring, set out in "How to Take the Bias Out of Interviews" (Harvard Business Review, 2016), lines up exactly with the structured approach above: ask the same questions, score each answer before moving on, and, where you can, compare candidates on a single dimension across all of them rather than forming one global impression per person. Structure, in other words, is not only the more accurate way to hire; it is also the more equitable one. That is a rare two-for-one, and worth saying plainly: the move that makes hiring fairer is the same move that makes it work better.

In practice that means designing the bias out at the points where it enters: strip identifying detail from the first screen where the role allows it, fix the questions and the rubric before anyone walks in, and have interviewers score independently before they confer, so the most senior or most confident voice in the debrief doesn't quietly anchor everyone else.

A worked example

Take a 40-person software company hiring its first dedicated customer-success manager. (Illustrative throughout, a teaching example, not a real company.) The founder's instinct is the usual one: line up four candidates, "have a proper conversation with each," and go with whoever clicks. Three founders before her have hired this way and two of those hires didn't last six months.

Instead she runs it through the lens above. First, the job: she writes down the three competencies the role actually lives or dies on, defusing an angry customer, spotting a renewal at risk early, and translating product gaps back to engineering. Then the questions, the same four for everyone, two behavioural and one situational, each mapped to a competency. Then, before meeting a soul, the rubric: for "defusing an angry customer," a weak answer blames the customer, an adequate one calms the call, a strong one calms the call and extracts the root cause so it doesn't recur.

flowchart LR
  A(["4 candidates,
same 4 questions"]) --> B(["2 interviewers score
each answer 1–5,
independently"]) B --> C{"Scores agree?"} C -->|"Yes"| D(["Strong evidence,
compare totals across
candidates"]) C -->|"No, wide gap"| E(["Discuss the specific
answer, not the person"]) E --> D D --> F(["Hire on the rubric,
not on who 'clicked'"])
The same four questions turn four unlike conversations into one comparable decision. Leaders Loop

The candidate the founder "clicked with" most turns out to score in the middle: warm, fluent, but every answer stops at calming the call and never reaches the root cause. The candidate she found a little flat scores highest, because her examples consistently close the loop. Without the rubric, charm wins and the company makes its third six-month mistake. With it, the founder has something better than a feeling, a like-for-like comparison she can defend, and a hire chosen on what the job actually needs. The interview still happened; it just stopped being the verdict and became the evidence.

Frequently asked questions

Aren't great managers just good at reading people?

Some are better than others, but the research is consistent that everyone, including confident, experienced interviewers, predicts performance far better with structure than without it. Skill at reading people is real; it is also exactly the faculty that overweights a vivid first impression and underweights the candidate who was nervous for twenty minutes. Structure doesn't replace your judgement, it stops your judgement being hijacked.

Doesn't a rigid script kill rapport and scare off good candidates?

Structured doesn't mean robotic. You can be warm, follow up naturally, and put people at ease while still asking everyone the same core questions and scoring against the same rubric. Candidates generally experience a fair, consistent process as more respectful, not less, they were assessed on the job, not on whether they happened to share your taste in football.

What about cognitive-ability or personality tests, should we use them?

Cognitive-ability tests are among the stronger predictors in the research and can add real signal, but they carry fairness, legal and candidate-experience considerations that vary by role and jurisdiction, so use them as one input alongside structured interviews and work samples, validate them for your actual roles, and check local employment law before you rely on them. Personality tests are popular but generally weaker predictors on their own; treat their output as a conversation-starter, not a score.

We're a small team with no budget. What's the minimum viable version?

Pick your next role, list three to five competencies it genuinely requires, write one question per competency (the same for every candidate), and jot down what a weak, adequate and strong answer looks like before you interview. Have two people score independently, then compare. That is the entire method, it costs an afternoon, and it outperforms most expensive hiring stacks built on unstructured chats.

What's the single biggest mistake hiring teams make?

Deciding in the debrief instead of in the design. When scoring criteria are invented after the interviews, or never written down at all, the loudest, most senior, or most recently-impressed voice sets the bar, and the "decision" is really a negotiation of impressions. Fix what "good" means before you meet anyone, and the debrief becomes a comparison of evidence rather than a contest of opinions.

Related in the Toolkit

Recruiting and assessing sit at the front of the talent lifecycle: who you attract (employer brand & talent attraction) determines the quality of the pool you assess, and the assessment mechanics here are detailed module-by-module in interviewing & selection.

Where to go next