Outliers, cleaning, normalisation & taxonomies

A team brings you a dashboard. Average deal size is up 30% this quarter, champagne, surely? Then someone notices a single enterprise contract, ten times any other, dragging the average skyward. The "trend" was one number. The dashboard wasn't lying; nobody had decided what to do with the outlier, so it decided for them.

The quick version

Outliers are unusually extreme values. The job isn't to delete them, it's to notice them and ask whether they're a glitch or a signal.
Cleaning fixes the obvious wrongness: duplicates, blanks, typos, "N/A" sitting in a number column. Unglamorous, and where most analyst time actually goes.
Normalisation puts numbers on a common scale so you can compare fairly, per customer, per pound spent, 0-to-1, instead of comparing apples to elephants.
Taxonomies are the agreed list of categories and what each one means. When two teams define "active customer" differently, the disagreement is really a taxonomy problem.

The idea in depth

These four words sound like data-engineering plumbing. They aren't. Each one is a place where a human quietly decides what counts, and that decision shapes every number that later lands on your desk. You don't need to become a statistician. You need to be the person in the room who asks the awkward question before the quarter is spent on a mirage.

Outliers: notice first, judge second

The discipline of looking at data before modelling it has a name and an author. The statistician John Tukey laid it out in Exploratory Data Analysis (Addison-Wesley, 1977), the book that gave us the box plot and a simple, durable rule for flagging extreme values. Tukey's "fences" mark a point as a potential outlier if it sits more than 1.5 times the interquartile range below the 25th percentile or above the 75th, and "far out" if it's beyond 3 times. The maths matters less than the spirit: have a consistent, boring rule for what gets flagged, so the flagging isn't done by whoever has the strongest opinion.

The rule has lasted because it makes no assumption about a tidy bell curve; it reads the shape of the data you actually have. Its honest limitation: 1.5 is a convention, not a law of nature, in a genuinely skewed distribution, like incomes or deal sizes, it will flag perfectly real values as "outliers." A flag is a prompt to look, never a verdict.

So the move is: when a metric lurches, ask "is this the average, or is this one row?" before you ask "what changed?" An outlier is sometimes an error to remove and sometimes the most important fact in the dataset, the fraud, the runaway success, the about-to-churn whale. You cannot tell which until you look at the row itself. Deleting outliers by reflex is how organisations make their data look calm and their decisions get worse.

A flag is a prompt to look, never a verdict.

Cleaning: the unglamorous 45%

Before any of this works, the data has to be trustworthy at the row level, and getting it there is most of the job. In Anaconda's 2020 State of Data Science survey of 2,360 practitioners across more than 100 countries, respondents reported spending, on average, 45% of their time loading and cleansing data before they could model or visualise anything. (You'll often hear "80%." That figure comes from older surveys and usually lumps in collecting the data too; treat the precise number as soft and the direction as solid, cleaning dominates.) Tidy, well-structured data is the foundation everything else stands on.

What does cleaning actually fix? Duplicates that double-count a customer. Blank cells silently read as zero. The same supplier spelled four ways. A date stored as text so it sorts "11 January" before "2 January." None of it is clever; all of it is consequential, because every downstream average and chart inherits the mess.

Hadley Wickham gave the cleaning craft its most useful organising idea in "Tidy Data" (Journal of Statistical Software, 2014, 59:10): each variable is a column, each observation a row, each type of thing its own table. Messy data is messy in endless ways; tidy data is all alike, and that sameness is what lets one set of tools work on everything.

So the move is: when a report surprises you, ask to see ten raw rows, not just the summary. Summaries hide the duplicates and the blanks. Raw rows confess them. Ten minutes reading rows will catch more bad analysis than an hour reviewing the conclusions.

flowchart LR
    A(["Raw data"]) --> B("Clean: dedupe, fix blanks & types")
    B --> C("Spot outliers: glitch or signal?")
    C --> D("Normalise: put on a common scale")
    D --> E("Classify: agreed taxonomy")
    E --> F(["A number you can trust"])

The four moves, in the order they protect you. Each one is a place a human decides what counts. Leaders Loop

Normalisation: comparing fairly

Normalisation is the move that lets you compare things that aren't on the same footing. Two flavours cover most of what a leader meets. Min-max scaling squeezes every value into a fixed range, usually 0 to 1, by subtracting the minimum and dividing by the spread. Standardisation (the z-score) re-expresses each value as "how many standard deviations from the average", recentering the data so its mean is 0. Both exist for the same reason: so a large-magnitude number can't dominate a comparison just by being measured in bigger units. DataCamp's guide to normalisation makes the point plainly, without rescaling, the feature with the biggest raw numbers quietly drowns out the rest.

You don't need the formulas to use the instinct. "Normalising" in plain management language means dividing by the thing that makes the comparison fair: revenue per employee, cost per acquisition, complaints per thousand orders. A region with more sales isn't more efficient if it also has triple the headcount. The raw total flatters the big; the per-unit figure tells the truth.

So the move is: whenever someone ranks teams, regions or campaigns on a raw total, ask "per what?" The honest comparison is almost always a rate, not a sum. This connects directly to the discipline of descriptive statistics, the average you quote is only as fair as the denominator underneath it.

Taxonomies: agreeing what the words mean

The most expensive data problems aren't arithmetic. They're definitional. A taxonomy is simply the agreed set of categories and the rule for what belongs in each, what counts as an "active" customer, a "qualified" lead, a "resolved" ticket. Get the taxonomy wrong and every clean, normalised number built on top of it is precisely wrong.

Geoffrey Bowker and Susan Leigh Star made the deep case for this in Sorting Things Out: Classification and Its Consequences (MIT Press, 1999). Drawing on real classification systems, diseases, occupations, even race under apartheid, they argue that categories are never neutral: they encode choices about what matters, become invisible once embedded, and "torque" the lives and decisions that flow through them. A category, once everyone stops questioning it, starts to feel like a fact of nature. It isn't.

The everyday version of this is two dashboards that disagree because Sales counts a customer "active" at first purchase and Finance counts them active only after payment clears. Neither is wrong; they're answering different questions with the same word. The fix is never a better chart, it's a shared definition, written down, that both teams point to.

So the move is: before reconciling two conflicting numbers, reconcile the two definitions. Ask each side to state, in one sentence, exactly what their category includes and excludes. Most "data discrepancies" dissolve the moment the taxonomy is spoken aloud. This is close kin to getting your data types right, a category is only as clean as the rule that fills it.

flowchart TD
    A(["'Active customers' = 4,000?"]) --> B{"Whose definition?"}
    B --> C("Sales: counts at first purchase")
    B --> D("Finance: counts after payment clears")
    C --> E(["Same word, two taxonomies"])
    D --> E
    E --> F("Agree one written definition")
    F --> G(["Now the numbers can reconcile"])

A "data discrepancy" is often a taxonomy disagreement wearing a numbers costume. Leaders Loop

A worked example

Picture a regional retail business. The Tuesday meeting opens on a slide: "Average basket size up 18%, pricing strategy working." The room is ready to celebrate. (Figures here are illustrative.)

A leader who knows these four moves asks four small questions instead.

Outliers. "Is that the typical basket, or the average?" A single B2B bulk order of £40,000 had landed in the consumer data. Strip that one row and basket size is up 3%, not 18%.

Cleaning. "Are returns netted out?" They aren't; a data-load glitch dropped the refunds table that week, so every gross sale counts and no credit does. The 3% is really closer to flat, and ten raw rows would have shown it.

Normalisation. "Up 3% against what?" Week-on-week it's up; against the same week last year, adjusted per store (three new stores opened since), it's down 2%. The fair comparison is per-store and year-on-year, not the flattering raw total.

Taxonomy. "Does 'basket' include click-and-collect?" Marketing's number does; finance's doesn't. Same word, two categories, and until they agree one definition, the two dashboards will never match.

None of these questions needs a statistics degree. Each needs only the habit of asking, before you act on a number, which of the four quiet decisions might have gone sideways.

Frequently asked questions

Should I just delete outliers to clean up my charts?

No, that's how you delete the most important data point in the set. Flag outliers, then investigate the actual row. Remove it only if you can show it's an error (a duplicate, a test transaction, a mis-keyed value). If it's real, it stays, and it might be the story.

What's the difference between cleaning and normalising?

Cleaning makes data correct, no duplicates, no blanks, no text where numbers should be. Normalising makes data comparable, on a common scale or per-unit basis. Clean first; you can't fairly compare numbers you can't yet trust.

Do I need to understand the statistics to use any of this?

No. The value is in the four questions, "one row or the average?", "can I see raw rows?", "per what?", "whose definition?", not the formulas. Your analysts handle the maths. You supply the scepticism.

Why do two teams keep reporting different numbers for the same thing?

Almost always a taxonomy gap: the same word ("active," "churned," "qualified") means different things to each team. The fix is a single written definition both sides agree to, not a reconciliation spreadsheet that papers over the disagreement every month.

What's the single most useful habit here?

Asking to see ten raw rows behind any surprising summary. It costs ten minutes and catches outliers, dirty data and definition mismatches in one move, the cheapest insurance against acting on a number that was never true.

Related in the Toolkit

Data types (discrete/continuous, categorical/ordinal), a taxonomy is only as clean as the data type underneath each category.
Descriptive statistics (mean, median, mode, variance, SD), outliers are exactly why the median often beats the mean.
Distributions, percentiles & quartiles, the quartiles behind Tukey's outlier fences live here.
Correlation vs causation, uncleaned data and stray outliers manufacture correlations that vanish on inspection.
Regression (linear, non-linear, logistic), a single outlier can tilt an entire regression line; normalisation keeps features on equal footing.
First principles vs heuristics vs analogical reasoning, "is this one row or the average?" is first-principles thinking applied to a metric.
Reversible vs irreversible decisions, knowing your data is shaky tells you how hard to lean on it before an irreversible call.
Jobs-to-be-Done & needs research, clean taxonomies of customer needs are what keep qualitative research analysable.

Where to go next

John Tukey, Exploratory Data Analysis (1977), the seminal source for box plots and the look-before-you-model discipline; the origin of the outlier "fences."
Hadley Wickham, "Tidy Data" (Journal of Statistical Software, 2014), the clearest single statement of what "clean, well-structured data" actually means, free to read.
Geoffrey Bowker & Susan Leigh Star, Sorting Things Out (MIT Press, 1999), why categories are never neutral and what happens when an organisation forgets that.
Hadley Wickham, "Tidy Data" talk (ViennaR Meetup, 2019), a short, watchable walk through tidy data from the person who defined it.
DataCamp, "What Is Normalization in Machine Learning?", a plain, worked explainer of min-max scaling versus the z-score, with examples.