Here is the uncomfortable fact most launch celebrations skip over: the day a model goes live is the day it starts to decay. The world it learned from keeps moving, prices, customers, fraud tactics, the words people type, and the model does not move with it. MLOps (machine-learning operations) is the practice of running models as living systems rather than finished products: deploying them safely, watching them in production, and catching the slow rot before it reaches a customer or a balance sheet.

The quick version

  • A model is not a feature you ship and forget. Its accuracy depends on the world matching its training data, and the world won't hold still.
  • Three things keep it honest: a model registry (a versioned record of what's running and how it was built), monitoring (watching live behaviour, not just server uptime), and drift detection (alarms for when the world moves away from the training data).
  • Most ML failure is operational, not mathematical. The hard part isn't building the model; it's keeping an integrated system reliable in production.
  • So the leadership move is to fund the lifecycle, not the launch, budget for monitoring and retraining the same way you budget for the build.

The idea in depth: why a "finished" model isn't

Traditional software is broadly deterministic: the same input gives the same output until someone changes the code. A machine-learning system breaks that contract. Its behaviour depends on two things, the code and the data it learned from, and the data is a moving target. (If that distinction is new to you, our Toolkit piece on probabilistic vs deterministic systems is the foundation under this one.)

The canonical warning came from Google. In Hidden Technical Debt in Machine Learning Systems (D. Sculley and colleagues, NeurIPS 2015), the authors put it bluntly: "it is dangerous to think of these quick wins as coming for free… we find it is common to incur massive ongoing maintenance costs in real-world ML systems." Their memorable point is that the trained model is a tiny box in the middle of a sprawling diagram of data pipelines, configuration, and monitoring, and the surrounding plumbing is where systems rot. Budget around that diagram, then, not the box. If a vendor or an internal team pitches "the model" as the deliverable, ask what happens to it in month seven.

The second idea names the rot. In supervised learning, a model assumes the relationship between inputs and the thing you're predicting stays stable. When it doesn't, you have concept drift. The reference survey here is Gama, Žliobaitė, Bifet, Pechenizkiy and Bouchachia, "A survey on concept drift adaptation" (ACM Computing Surveys, 2014), which defines drift as the case where "the relation between the input data and the target variable changes over time." A fraud model trained on last year's scams. A demand forecast trained before a competitor cut prices. A hiring screen trained on a workforce that no longer reflects who applies. None of these throw an error. They just get quietly, expensively wrong. The question to keep live, then, is "is the world still the world we trained on?", a thing you measure, not an assumption you make once.

flowchart LR
  A(["Real world
changes"]) --> B(["Live data drifts
from training data"]) B --> C(["Model accuracy
silently degrades"]) C --> D(["Bad decisions
reach customers"]) D -. "monitoring catches it here" .-> E(["Alert &
retrain"]) E --> F(["Refreshed model
back in service"]) F --> A
Drift is a loop, not an event: the world moves, accuracy slips, and the only question is whether you notice before your customers do. Leaders Loop

The idea in depth: the three tools that keep a model honest

MLOps is sometimes described as "DevOps for machine learning," and Google's widely-cited practitioner guide, MLOps: Continuous delivery and automation pipelines in machine learning, defines it as "an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops)." That guide sketches a maturity ladder, from Level 0 (a data scientist deploys a model by hand and walks away) to Level 1 (automated pipelines that retrain on fresh data) to Level 2 (full CI/CD, with model registries and monitoring wired in). You don't need Level 2 tomorrow. You need to know which rung you're on and whether it matches the stakes of the decision the model makes. Three capabilities define the climb.

The model registry is the least glamorous and most important. It's a versioned catalogue of every model: which one is live, what data and code produced it, how it scored before release, and who approved it. Without it, "which model made this decision in March?" has no answer, a problem that becomes acute the moment a regulator, an auditor, or an angry customer asks. Insist that every model in production traces back to its training data and its sign-off, the same way you'd expect any financial number to trace to a source. This is also where governance lives; the registry is the spine of the model-risk and explainability work that increasingly sits on a leader's desk.

Monitoring for ML is not the same as monitoring for software. Your dashboards may show the service is up, fast, and error-free while the model is confidently wrong. Real ML monitoring watches the inputs (has the data coming in changed shape?), the outputs (is the model suddenly predicting "approve" twice as often?), and, when you can get it, the outcomes (were the predictions actually right?). The catch is honesty about the lag: for a loan model, you may not know a prediction was wrong until the loan defaults months later. Watch the leading signals you can see now, the input and output distributions, instead of waiting for a ground truth that arrives too late to help.

Drift detection is monitoring with a tripwire. It compares today's live data against the training baseline and raises an alarm when they diverge far enough to matter. Crucially, it tells you something changed, not why, which is why a drift alert should trigger a human investigation, not an automatic retrain. Sometimes the fix is a new model; sometimes it's discovering an upstream data pipeline broke and is feeding the model garbage.

Your dashboards can show every light green while the model is confidently, expensively wrong.

The idea in depth: name the limitation honestly

It would be neat to say "monitor for drift and you're safe." You aren't, and pretending otherwise sets a trap. Distinguishing a genuine, harmful shift from ordinary noise is hard, and the tools generate false alarms. The Gama survey is candid that adaptation strategies trade off against each other, react too fast and you chase noise; react too slow and you miss real change. There's no setting that's right for every case.

A second limit worth saying out loud: drift on the inputs doesn't always mean the model got worse, and stable inputs don't guarantee it's still good, the link the model relies on can rot even when the data looks familiar. The honest takeaway is that MLOps reduces the time you spend wrong; it does not deliver a model that polices itself. The point isn't to remove human judgement, it's to make sure a human is looking at the right signal at the right time. A model that decides who gets a loan deserves tighter watch than one that recommends a playlist. That sorting, matching oversight to stakes, is a leadership call, and it leans on whether a decision is reversible or irreversible.

A worked example: the churn model that aged badly

A subscription business builds a model to flag customers likely to cancel, so the retention team can call them first. It launches in spring with strong numbers and everyone moves on. (Figures below are illustrative.)

By autumn, the retention team is quietly grumbling that the call list feels "off." Nobody escalates, because nothing is broken, the model runs nightly, the dashboard is green, predictions arrive on time. What no one is watching is that a competitor launched a cheaper plan in July. The reason customers leave has shifted from "too expensive for what I use" to "there's a better deal next door", textbook concept drift. The model, trained on spring behaviour, keeps confidently flagging the old pattern. Say its useful accuracy has slid from a launch figure of roughly 80% to the low 60s; the team burns calls on customers who were never leaving while missing the ones who are.

An MLOps-equipped version of this team catches it in week three, not month five. Input monitoring shows the mix of cancellation reasons has shifted sharply from the training baseline; a drift alert fires; an analyst investigates, confirms the competitor effect, and the model is retrained on recent data. The model registry means they can see exactly which version was live during the bad stretch, quantify what it cost, and roll forward with a clean audit trail. The Monday-morning version of this for a non-technical leader: when you sign off an ML project, ask three questions, "How will we know if it stops working? Who gets the alert? And what's our budget to retrain it?" If the answers are vague, you've funded a launch, not a system.

flowchart TB
  subgraph BUILD ["Build (one-off)"]
    T(["Train &
validate model"]) --> R(["Register version
in model registry"]) end subgraph RUN ["Run (forever)"] D(["Deploy to
production"]) --> M(["Monitor inputs,
outputs, outcomes"]) M --> W{"Drift or decay
detected?"} W -- "No" --> M W -- "Yes" --> I(["Human investigates
the cause"]) end R --> D I --> T
The model lifecycle: building is a project with an end; running is a loop without one. Most of the cost lives in the bottom half. Leaders Loop

Frequently asked questions

Isn't MLOps just DevOps with a new label?

It borrows the culture, automation, version control, continuous delivery, but it adds problems DevOps never had. In normal software, code is the only thing that changes. In ML, the data changes too, and it changes on its own. That's why MLOps adds data and model versioning, drift detection, and retraining pipelines on top of the usual deployment machinery.

How often should we retrain a model?

There's no universal cadence, and a fixed schedule is a weak substitute for watching the signal. Fast-moving domains (fraud, ad pricing, anything adversarial) may need frequent refreshes; a stable industrial sensor model might run for a year untouched. The better posture is to retrain in response to monitored drift and decay rather than on the calendar, let the system tell you when it's stale.

We're not a tech company. Do we really need this?

If a model influences a real decision, pricing, credit, hiring, demand planning, then yes, because the failure mode is silent. You can run a lightweight version: agree what "working" means, watch a couple of input and output metrics, and name a person who owns the alert. The discipline matters more than the tooling.

What's the single cheapest thing to start with?

A definition of "working" and an owner for it. Most ML disasters aren't a missing platform; they're that nobody was assigned to notice the model had drifted. Write down the metric, the threshold, and the name. That alone moves you off Google's "Level 0."

Can't we just automate the whole loop and remove the humans?

You can automate detection and retraining, but auto-retraining on a bad signal can make things worse, if a broken pipeline is feeding the model garbage, retraining bakes the garbage in. Keep a human between "something changed" and "ship a new model," especially when the decision is hard to reverse.

Related in the Toolkit

Where to go next