Product & data risk: the exposure you build in before you ship

A product team's instinct is to collect. More fields, more events, more history, because data feels like pure upside, the fuel for the next feature and the next model. But the moment you store someone's personal data, you have also taken on a liability: something that can leak, be misused, attract a regulator, or simply sit there as a target. Product and data risk is the discipline of seeing both sides of that ledger before you ship, and deciding on purpose how much exposure you are willing to build in.

The quick version

Data is a liability as well as an asset. The data you hold can be breached, misused or over-collected, so the safest data is the data you chose not to keep. "Don't collect it" is the strongest control there is.
Privacy and security are design decisions, not bolt-ons. Deciding what to collect, how long to keep it, and who can see it is cheap at the wireframe stage and ruinously expensive after launch.
Risk = how likely something goes wrong × how badly it hurts people. The job isn't to eliminate risk, it's to know which data carries the most, and put your effort there.
Two anchors make it concrete: the NIST Privacy Framework (a structured way to find and manage privacy risk) and GDPR's principle of data minimisation (collect only what you need, keep it only as long as you need).

The idea in depth: data is a liability, not just an asset

Start with the reframe, because it changes every decision downstream. Most product cultures treat data as inventory you accumulate, and the accounting only ever shows the asset column. The discipline begins when you put a number, even a rough one, in the liability column. The clearest industry benchmark is IBM's annual Cost of a Data Breach Report, researched by the Ponemon Institute, which put the global average cost of a breach at USD 4.88 million in 2024, up about 10% on the prior year's USD 4.45 million, the largest jump since the pandemic. That figure is an average across hundreds of organisations and varies hugely by sector and size, so don't treat it as your number. Treat it as proof that the liability column is real and rising.

Make the cost visible at the point of decision. When a team proposes a new field, a phone number, a precise location, a date of birth, the design review should ask not only "what does this unlock?" but "what does it cost us to hold, and what happens to a customer if it leaks?" The strongest answer is often the simplest: don't collect it. Data you never stored cannot be breached, subpoenaed, or used against the person who gave it to you. This is the single cheapest security control in existence, and it is available only at design time.

Why is over-collection so dangerous? Look at how little it takes to identify someone. In a foundational Carnegie Mellon study, "Simple Demographics Often Identify People Uniquely" (Latanya Sweeney, 2000), Sweeney showed that roughly 87% of the US population could be uniquely identified from just three innocuous-looking fields: ZIP code, date of birth, and sex. None is "sensitive" alone; combined, they are a fingerprint. Risk hides not only in the obvious fields, passwords, card numbers, but in the accumulation of ordinary ones. So treat combinations as the unit of risk, and ask whether you need full precision at all. A birth year instead of a full date, a region instead of a ZIP, either often does the job with a fraction of the exposure.

The safest data is the data you chose never to collect.

An honest limitation. Minimisation is powerful but it is not free, and it is not always right. The data you decline to collect is also data you cannot use later, to debug a rare failure, to train a model, to honour a future feature your customers will ask for. Minimisation trades optionality for safety, and reasonable teams will draw that line differently depending on what they're building. The point is not to collect nothing; it is to make the trade deliberately, field by field, rather than defaulting to "grab everything, decide later", because "later" usually never comes, and the data quietly compounds into a liability nobody chose.

Make privacy and security design decisions, not clean-up

If data is a liability, the cheapest time to manage it is before the schema exists. This is the core of Ann Cavoukian's Privacy by Design, whose seven foundational principles argue for being proactive not reactive and for privacy as the default setting, the user shouldn't have to do anything to be protected. It is fairly criticised as aspirational rather than prescriptive: it gives you the destination, not the route. But as a design stance it earns its place, because retrofitting privacy into a shipped product means migrating data, rewriting permissions, and renegotiating with users who already trusted you with too much.

The route, the prescriptive part, comes from threat modelling. Security practitioner Adam Shostack reduces it to four questions any team can ask of a design: What are we working on? What can go wrong? What are we going to do about it? Did we do a good job? Its strength is that it needs no specialist tooling, a product manager, an engineer and a whiteboard can run it on a feature in an afternoon. Put those four questions into your design-review template, with one privacy-specific twist: when you ask "what can go wrong?", ask it about the person whose data this is, not only about the system. A breach is one failure mode; so is using the data for a purpose the customer never agreed to.

flowchart LR
  A(["What are we
working on?"]) --> B(["What can
go wrong?"])
  B --> C(["What are we
doing about it?"])
  C --> D(["Did we do
a good job?"])
  D -.->|"revisit each release"| A

Shostack's four-question frame, a threat-modelling loop a product team can run on a feature without specialist tooling. Leaders Loop

To make this auditable, lean on a structured framework. The NIST Privacy Framework v1.0 (January 2020) organises the work into five functions, Identify-P, Govern-P, Control-P, Communicate-P and Protect-P, and defines privacy risk as the likelihood that data processing causes a problem for an individual, weighed against the impact if it does. That framing points the risk lens at the person, not just the asset, and it is distinct from cybersecurity, because a perfectly secure system can still harm people through how it uses their data legitimately. So keep a living data inventory (Identify-P), what you hold, why, where it flows, because you cannot govern, control or protect data you cannot see. Most teams, when they finally draw the map, find they hold far more than anyone remembers deciding to.

The metric that keeps you honest: only collect what you need

If one principle survives from all of this, it is data minimisation, and unlike most of the above, in much of the world it is law, not advice. Article 5 of the GDPR sets out three principles that read like a product checklist: data minimisation (personal data must be "adequate, relevant and limited to what is necessary"), purpose limitation (collected for specified, explicit purposes and not repurposed freely), and storage limitation (kept no longer than necessary). Equivalent principles appear in laws well beyond Europe, from the UK GDPR to a growing patchwork of US state privacy laws, so the specifics depend on where your users are, and you should check your jurisdiction or a qualified professional rather than assume one rulebook fits all.

The teeth behind GDPR are worth stating plainly, because they move the liability from abstract to budgeted. Article 83 sets a maximum fine of up to €20 million or 4% of total worldwide annual turnover, whichever is higher, for breaching the core processing principles, including minimisation. For a large company the percentage dwarfs the fixed figure. The number is a ceiling, rarely the actual penalty, but it reframes "should we collect this?" as a question with a price tag attached.

Make minimisation a recurring habit, not a one-off audit. Put a retention period on every data type at the moment you design it, and a deletion job that enforces it, so data expires by default instead of accumulating forever. And run a periodic "data diet": review what you collect and ask, field by field, whether you still need it. The honest answer is usually that some of it was collected because it was easy, not because anything depends on it.

A worked example

Take a fictional fitness app, call it Stride. (Illustrative throughout; a teaching example, not a real product.) The growth team wants a "find friends nearby" feature, and the proposed design collects continuous precise GPS location, stored indefinitely, available to data science for "future use." On the asset ledger it looks great: a social hook and a model-training goldmine.

Run it through the lens above and the liability ledger lights up. What can go wrong, for the person? A continuous precise-location history is among the most sensitive data a phone can produce, it reveals where someone sleeps, works and meets. Combined with the account's date of birth and city, it is trivially re-identifiable: the Sweeney problem at industrial scale. "Stored indefinitely, for future use" fails purpose limitation and storage limitation in one line. And the breach math is unforgiving: if this leaks, the harm isn't a reset password, it's physical safety.

flowchart TD
  A(["Proposal: continuous precise GPS,
kept forever, 'future use'"]) --> B{"Do we need this
precision & history
for the feature?"}
  B -->|"No, the feature only needs
'are we near each other now?'"| C(["Minimise: coarse location,
computed live, never stored"])
  B -->|"Yes, for some users"| D(["Purpose-limit + retention:
opt-in, 30-day expiry,
access logged"])
  C --> E(["Same feature,
a fraction of the liability"])
  D --> E

The same feature, redesigned around what it actually needs, minimisation as a product decision, not a compliance tax. Leaders Loop

The redesign doesn't kill the feature; it right-sizes the data. "Find friends nearby" needs to answer one question, are two people near each other right now?, which you can compute from coarse, live location and never store. The precise history was requested for a "future use" nobody had specified. So the design becomes: coarse location, computed in the moment, opt-in, and, for any history you genuinely justify, a 30-day expiry with access logged. Same user-facing feature, a fraction of the exposure. That is the discipline in one decision: the team didn't accept the risk or reject the feature; they asked what it actually needed, and declined to hold the rest.

Frequently asked questions

Isn't this just the security or legal team's job?

They own controls and compliance, but the decisions that create the risk are product decisions, what to collect, how long to keep it, who can see it. By the time data reaches security or legal, the liability is already designed in. The cheapest place to manage data risk is the wireframe and the schema, which is where product leaders work. Security and legal are partners you bring in early, not a clean-up crew you hand the mess to.

What's the difference between privacy and security?

Security is about stopping unauthorised access, keeping the wrong people out. Privacy is about what you do with data even when access is perfectly authorised, whether you collect it, why, and whether that use respects the person. A system can be flawlessly secure and still violate privacy by using legitimately-held data in ways the customer never agreed to. The NIST Privacy Framework draws this line deliberately: the two overlap, but they are not the same risk.

We're pre-launch with no users. Does this matter yet?

It matters most now. Every choice you make about data at the schema stage is cheap to change and expensive to reverse once real user data is flowing and features depend on it. Pre-launch is the one moment you can choose not to collect something at no cost. Teams that defer data-risk thinking until they "have time" inherit a data estate they didn't deliberately design, and then pay to migrate out of it.

How do I decide which data is risky without a privacy expert?

Start with a simple inventory and rank by two questions: how badly would a person be harmed if this leaked or were misused, and how likely is that given how the data flows? That is the NIST likelihood-times-impact framing, and you can do a first pass on a whiteboard. Location, health, financial, children's data, and any combination that re-identifies someone sit at the top. You don't need a specialist to find the worst exposure; you need to look honestly and put your effort there.

Doesn't minimising data hurt the product and the models?

Sometimes, and that's a real trade-off, not a trick question. Data you don't keep is data you can't analyse later. The discipline isn't "collect nothing", it's "collect on purpose." Often you can keep the value while shedding the risk: aggregate instead of storing raw records, reduce precision (birth year, not full date), or set a retention window so you get the recent signal without an indefinite liability. Decide the trade field by field, rather than defaulting to hoarding because deleting feels like loss.

Related in the Toolkit

Product and data risk sits on top of the security and privacy foundations, the structured way to ask "what can go wrong?" lives in security fundamentals & threat modelling, and the legal backbone of minimisation is covered in data privacy & PII handling.

Security fundamentals & threat modelling, the four-question frame and the CIA triad that this article applies to data-collection decisions.
Identity & access management, controlling who can see the data you do decide to keep is half of "Protect-P."
Data privacy & PII handling (GDPR and equivalents), the legal detail behind minimisation, purpose limitation and storage limitation.
Data retention, residency & sovereignty, how long you keep data and where it lives, the operational side of storage limitation.
Cyber risk & incident response, what to do when a control fails and the liability you designed in becomes a live event.
Financial statements (P&L, balance sheet, cash flow), the literal ledger metaphor: data risk is a contingent liability, and breaches hit real lines.
Lean, Six Sigma, Kaizen & continuous improvement, the "data diet" is a continuous-improvement habit, not a one-off audit.
Hosting & cloud architecture, where data physically rests shapes both its security posture and its residency risk.

Where to go next

NIST Privacy Framework v1.0 (PDF), the structured, free, vendor-neutral way to identify and manage privacy risk; start with the five Functions and the risk definition.
GDPR Article 5, Principles relating to processing, the plain-language source for data minimisation, purpose limitation and storage limitation; short enough to read in five minutes.
"Simple Demographics Often Identify People Uniquely", Latanya Sweeney (2000), the study behind the 87% figure; the clearest argument for why "non-sensitive" fields combine into risk.
"Game On: Adding Privacy to Threat Modeling", Adam Shostack & Mark Vinkovits (YouTube), a practical talk on running threat modelling with privacy harms in scope, not just security ones.
Privacy by Design: The 7 Foundational Principles, Ann Cavoukian (PDF), the original two-page statement of "proactive, not reactive" and "privacy as the default."