Data governance, quality, lineage & stewardship

Two teams pull "active customers" for the same Monday meeting and arrive with numbers eleven thousand apart. Neither is lying. One counted anyone who logged in this quarter; the other counted anyone with a paid plan. The argument that follows isn't about data, it's about the absence of an agreement on what the data means. That agreement, and the machinery that keeps it true, is what these four words are for.

The quick version

Governance is the rulebook: who gets to decide what a number means, and who is accountable when it's wrong.
Quality is whether the data is actually fit for the decision you're about to make, not "is it accurate" but "is it accurate enough, for this."
Lineage is the receipt: where a number came from and every step it passed through, so you can trust it or trace the break.
Stewardship is the people part: named humans who own a domain's data the way someone owns a P&L.

The idea in depth: governance is decision rights, not a tool you buy

The reflex when "data governance" lands on a leadership agenda is to procure a platform. That gets the order backwards. The discipline's reference text, DAMA International's Data Management Body of Knowledge (DMBOK, 2nd ed., 2017), frames governance as the exercise of authority and control over the management of data assets, in plainer terms, who holds the decision rights, what the policies are, and how accountability is enforced. John Ladley, in Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program (2nd ed., 2019), makes the same point from the practitioner side: governance is a program of decision-making and accountability, not a one-off software install. A tool can record a definition; it cannot decide which definition wins.

So the move is: before you buy anything, pick three numbers your leadership team argues about, and for each name one accountable owner and one written definition. You have just done more governance than most "rollouts" achieve in a quarter.

A limitation worth naming honestly: the centralised model, a single governance council that approves every definition, scales badly. As organisations grew and analytics spread into every team, the bottleneck became obvious. Zhamak Dehghani's data mesh work (Martin Fowler's site, 3 December 2020) argues for federated governance instead: global rules everyone follows for interoperability, but the domain teams closest to the data own its definitions and quality. There is real debate about how far to decentralise, and mesh is more philosophy than settled method. But the underlying instinct, push ownership to where the knowledge lives, keep only the shared rules central, is sound, and it pairs naturally with treating reversible decisions differently from irreversible ones.

The idea in depth: quality is fitness for use, not perfection

The most useful thing ever written about data quality is that it is not one thing. In their 1996 paper "Beyond Accuracy: What Data Quality Means to Data Consumers" (Journal of Management Information Systems, 12:4), Richard Wang and Diane Strong did something unusual: they asked the people who actually use data what "quality" meant to them, then sorted the answers empirically. They found it splits into four families, intrinsic (accuracy, believability), contextual (relevance, timeliness, completeness for the task), representational (is it understandable and consistent), and accessibility (can you get it, securely). Their headline, three decades on, still corrects the common mistake: teams obsess over accuracy and neglect the rest. A perfectly accurate report that arrives a week late, or in units nobody understands, is low-quality data.

Quality is defined by the consumer and the task, not by the database.

So the move is: stop asking "is this data good?" and start asking "good enough for what?" The churn figure you'd accept for a blog headline is not the one you'd accept before cutting a product line. Set the quality bar per decision, and write it down next to the metric. This is also why knowing whether you're in a probabilistic or a deterministic system matters: payroll needs near-perfect accuracy; a recommendation engine can tolerate noise and still earn its keep.

The idea in depth: lineage is the receipt that makes trust auditable

When a number looks wrong, the slow, expensive question is "where did this come from?" Data lineage answers it by recording the data's journey, its origin, every transformation applied, and where it ends up. It is closely related to provenance, which focuses on origin and authenticity, who created this, when, from what. Lineage is the fuller end-to-end map of the trip. The point of either is the same: trust you can check, not trust you have to assume.

Lineage earns its keep in two moments. The first is the fire drill: a dashboard reads zero on a Monday, and lineage lets you walk backwards from the chart to the broken upstream job in minutes instead of days. The second is the audit: when a regulator or a board asks how a figure was produced, lineage is the answer. This is the same reasoning-from-source instinct that first-principles thinking applies to arguments, turned on data.

So the move is: for your top handful of board metrics, commission a one-page lineage sketch, source system, the transformations in between, the final report. You don't need a platform to start; a diagram a steward can defend is worth more than an automated tool nobody reads.

flowchart LR
  A(["Source system
(CRM, billing, app)"]) --> B("Ingest & clean")
  B --> C("Transform:
define 'active customer'")
  C --> D("Model & aggregate")
  D --> E(["Board dashboard"])
  E -.->|"reads wrong?"| C
  E -.->|"trace back"| A

Lineage as a receipt: each step is a place a number can be defined, broken, or traced. Leaders Loop

The idea in depth: stewardship is the human accountability layer

Governance writes the rules; stewardship is who lives by them day to day. The DMBOK describes a data steward as the person accountable for ensuring data content and metadata stay consistent with the organisation's policies and business rules, usually a domain expert embedded in the business, not a central IT role. The distinction that trips leaders up: an owner is accountable (often a senior leader who can resolve disputes and fund fixes); a steward is responsible for the hands-on work of definitions, quality rules and exceptions. Confusing the two is how "everyone owns the data" quietly becomes "no one does."

So the move is: for each critical data domain, customers, revenue, headcount, name one steward by person, not by team. Give them a thin mandate (they arbitrate the definition and the quality bar) and a seat where those decisions get made. Stewardship fails when it's a title with no authority and no time.

flowchart TB
  G(["Governance council
sets shared rules"]) --> O("Data owner
accountable, funds fixes")
  O --> S("Data steward
defines, monitors quality")
  S --> Q(["Fit-for-use data"])
  Q --> U(["Decision-makers
trust the number"])

Who answers for what: rules flow down, accountability has names, trust is the output. Leaders Loop

A worked example: the two "active customer" numbers

Return to that Monday meeting. Marketing's dashboard says 47,000 active customers; finance's says 36,000. (Figures illustrative.) The instinct is to declare one "right." The governed answer is better: both are correct for different questions, and the organisation never decided which question the board metric answers.

Here's the sequence a steward runs. First, governance: convene the owner, say, the VP of Revenue, to rule that the official "active customer" means a paying account with activity in the last 30 days. One definition, written down, owned. Second, quality: set the bar for this decision, the figure must reconcile to billing within 1%, refreshed daily; "roughly right next week" won't do for a board number. Third, lineage: trace both numbers and discover marketing's pulls from the product login table (no payment join), finance's from billing. The gap was never an error, it was two definitions wearing the same name. Fourth, stewardship: the revenue steward publishes the canonical metric, retires the duplicate, and documents the 30-day, paying, billing-reconciled rule so the argument doesn't return in Q3. The eleven-thousand-customer mystery dissolves into a one-line definition, which is what these four disciplines are quietly for.

Frequently asked questions

Isn't this just bureaucracy that slows teams down?

It can be, if you govern everything. The fix is proportionality: apply the full machinery to the handful of numbers that move money or carry risk, and leave exploratory analysis loosely governed. Federated governance (the data-mesh idea) exists precisely to avoid a central committee throttling every team. Govern the few things that must agree; let the rest breathe.

Do we need to buy a data catalog or lineage tool first?

No. Tools accelerate governance; they don't create it. A definition no human has agreed to is just a field in software. Start with named owners, written definitions for your top metrics, and a hand-drawn lineage sketch. Buy tooling once you've outgrown the spreadsheet, not as a substitute for the decisions.

What's the difference between a data owner and a data steward?

The owner is accountable, usually a senior leader who can settle disputes and fund fixes. The steward is responsible, the domain expert who does the daily work of definitions, quality checks and exceptions. One owner, one steward, per critical domain is a workable default. Vague collective ownership is the failure mode.

How is data quality different from data accuracy?

Accuracy is one ingredient. Wang and Strong's 1996 research showed data consumers judge quality across four families, intrinsic, contextual, representational and accessibility. A figure can be perfectly accurate and still low-quality if it's late, unintelligible, or you can't get at it. Quality is fitness for the task, not correctness in isolation.

Where does this connect to AI and machine learning?

Directly. Models inherit the quality and bias of their training data, and an ungoverned pipeline produces ungovernable model behaviour. Lineage is how you answer "what data trained this?" when it matters. See algorithmic bias, explainability & model risk for the downstream story.

Related in the Toolkit

Data strategy & data as an asset, governance is how you protect an asset; this is the case for treating data as one.
Algorithmic bias, explainability & model risk, what ungoverned, low-quality data does once it reaches a model.
Machine learning concepts & utility, why "garbage in" is not a cliché but a governance failure.
AI capabilities & limits (LLMs, generative AI, agents), why grounded, traceable data is the harness generative systems need.
Probabilistic vs deterministic systems, how much quality and accuracy a given system genuinely requires.
First principles vs heuristics vs analogical reasoning, lineage is first-principles thinking applied to a number.
Reversible vs irreversible decisions, how hard to govern a metric depends on the weight of the decision it feeds.
Jobs-to-be-Done & needs research, "fitness for use" begins with knowing the job the data is hired to do.

Where to go next

DAMA-DMBOK (2nd ed., 2017), the field's standard reference; dense, but the definitive map of governance, quality and stewardship as a single discipline.
John Ladley, Data Governance (2nd ed., 2019), the most practical book on actually standing up a program: roles, operating model, and how to keep it alive after launch.
Wang & Strong, "Beyond Accuracy" (1996), the seminal, freely-readable paper that reframed quality as fitness for use; short and still the clearest thing on the subject.
Zhamak Dehghani, "Data Mesh Principles and Logical Architecture" (2020), the case for federated governance and domain ownership; read it for the limits of the centralised model.
"Keynote, Data Mesh" by Zhamak Dehghani (YouTube), a clear talk on why the old centralised data model breaks and what decentralised ownership looks like.