IT service management & operations (ITIL): keeping the lights on

Every organisation that depends on software eventually discovers that the hard part isn't shipping it, it's the unglamorous business of running it: someone reports the login is broken, the payroll batch fails on the one night it can't, and a question lands on a leader's desk that no roadmap answers, who fixes this, how fast, and how do we stop it happening again? IT service management (ITSM) is the practice of answering that well, and ITIL is the most widely adopted set of words and habits for doing so.

The quick version

IT service management (ITSM) is the work of designing, delivering, supporting and improving the IT services people rely on, email, the payroll system, the website, treating "keeping it running" as a managed discipline, not luck.
ITIL (the IT Infrastructure Library) is the best-known ITSM framework: a shared vocabulary and a catalogue of good practices. It is descriptive guidance, not a law, you adopt the bits that fit.
Two ideas do most of the daily work: an incident (restore service fast, the login is down, get it back) versus a problem (find and remove the root cause so it stops recurring). Confusing the two is the classic mistake.
The risk is ritual: heavy process that slows good teams down without making services more reliable. ITIL 4 explicitly tries to fix this with principles like "keep it simple and practical."

The idea in depth: a "service" is a promise, not a server

The word that unlocks all of this is service. ITIL defines a service as a means of enabling value for customers by helping them get outcomes they want, without their having to own the costs and risks. Read that twice: the service is not the server, the database, or the code. It is the outcome, "I can pay my staff on time", delivered so reliably that the person using it never has to think about the machinery underneath. Operations is the work of keeping that promise.

The practical move is to stop describing your technology as a pile of components and start describing it as a short list of services with owners. Most IT estates have no agreed answer to "what services do we provide, and who is accountable for each?" Write that list, payroll, email, the customer portal, name a human owner per service, and you have done more for reliability than another monitoring tool will.

ITIL itself has a long, quietly British history. The acronym dates to the late 1980s, when the UK government's Central Computer and Telecommunications Agency documented good practice so public-sector IT stopped reinventing the same mistakes; the first version landed around 1989 (ITIL, Wikipedia; IBM, "What is ITIL?"). It passed through the Office for Government Commerce (v2 in 2000, v3 in 2007), then to AXELOS in 2014, which published the current ITIL 4 in 2019; the trademark is now owned by PeopleCert. The honest takeaway: ITIL is curated, commercial best practice with certification attached, battle-tested, but a product with an exam, not a scientific result.

Incident vs problem: the distinction that runs the day

If you learn one thing from ITIL, learn this. An incident is an unplanned interruption to a service or a drop in its quality, the site is down, the report won't generate. The goal of incident management is singular and ruthless: restore service as fast as possible. A restart, a failover, a rollback all count as success. You are explicitly not required to find the root cause to close an incident.

A problem is the underlying cause of one or more incidents. Problem management is the detective work that happens once the fire is out: find why it broke, record a workaround, and drive a permanent fix. When the root cause is identified but not yet fixed, ITIL calls it a known error, often logged in a known-error database so the next person who hits it gets the workaround instantly instead of re-diagnosing from scratch (Atlassian, "Problem vs incident management"; "Known error," Wikipedia).

flowchart TD
  A(["Something breaks
service degraded or down"]) --> B(["INCIDENT
goal: restore service fast"])
  B --> C(["Restart / failover /
rollback / workaround"])
  C --> D(["Service back
incident closed"])
  D --> E{"Will it recur?"}
  E -->|"Yes, find the cause"| F(["PROBLEM
root-cause investigation"])
  F --> G(["KNOWN ERROR
cause found, workaround logged"])
  G --> H(["CHANGE
permanent fix shipped"])

Restore first, diagnose second: incident management ends when service is back; problem management is the separate job of stopping it recurring. Leaders Loop

So separate the two clocks. During an outage, nobody hunts root cause, you restore, communicate, and protect the customer. Afterwards, a calmer review turns the incident into a problem record and, ideally, a permanent fix. Teams that blur these waste the outage arguing about why while users sit broken, then never circle back to prevent the next one. The same instinct underpins the blameless post-mortem popularised by Google's Site Reliability Engineering practice: the review asks how the system and process let the failure happen, not whose fault it was, because fear is what stops people reporting the near-misses that prevent the big one (Google SRE, incident management).

An incident asks "is it back yet?" A problem asks "why did it break, and how do we stop it?" Different questions, different clocks, different days.

ITIL 4: four dimensions, seven principles, and the anti-bureaucracy turn

The 2019 rewrite was partly an apology for what ITIL had become in some shops: a thicket of forms and approval boards that made IT feel like the department of "no." ITIL 4 reframes everything around a service value system and insists you balance four dimensions of service management, (1) organisations and people, (2) information and technology, (3) partners and suppliers, and (4) value streams and processes. The point of naming four is to stop teams from "solving" reliability by buying a tool (dimension two) while ignoring that the on-call rota is burning people out (dimension one) (ITSM.tools, "Four dimensions").

Sitting on top are seven guiding principles: focus on value; start where you are; progress iteratively with feedback; collaborate and promote visibility; think and work holistically; keep it simple and practical; and optimise and automate (University of Utah IT, "ITIL 4"). Treat ITIL as a menu, not a mandate: "start where you are" and "keep it simple and practical" are real principles. Take the one practice that hurts most, say, no agreed severity scale, so every outage is chaos, and adopt just that. If a step in your change-approval flow doesn't reduce real risk, ITIL 4 itself tells you to delete it. Resist the certification-driven urge to stand up a dozen "processes" at once.

An honest limitation. ITIL's biggest weakness is the one its critics never tire of: it can ossify into ceremony. A change-advisory board that meets weekly to rubber-stamp routine deployments adds latency and a false sense of safety, and the modern evidence runs the other way. The DORA research programme, summarised in Accelerate (Forsgren, Humble & Kim, 2018), found that high-performing teams deploy more often with lower change-failure rates, and that requiring approval from an external body (a manager or a change-advisory board) did nothing to improve stability, no measurable effect on change-failure rate, while it slowed delivery down. The book's blunt summary: external change approval is worse than having no approval process at all. ITIL and modern delivery aren't enemies, but implement ITIL as gatekeeping and you get the bureaucracy while losing the reliability. Treat it as shared language for "who owns what, and what we do when it breaks," and it earns its keep; treat it as a compliance regime, and good engineers route around you.

A worked example

Take a mid-sized retailer, call it Harbour & Co, whose online checkout went down for ninety minutes on a Friday. (Illustrative scenario; figures are for teaching, not a real company.) The first time, there was no plan: six engineers piled onto a call, three of them debugging the database while the on-call lead tried to work out who was even in charge. Service came back when someone restarted a stuck queue, but nobody could say why it stuck, and two weeks later it happened again.

Now run it through the ITIL lens. Incident management first: Harbour & Co agrees a severity scale (Sev-1 = revenue-impacting outage), names a single incident commander whose only job is coordination, and writes a one-line restoration playbook ("restart the queue, then page the data team"). The next outage is twelve minutes, not ninety, because the goal is restoration and someone owns the call. Problem management second: the next day, a blameless review finds the queue backs up whenever a nightly job overruns, that's the problem. The workaround (an alert plus a documented restart) becomes a known error; the permanent fix (a queue limit) ships through change management the following sprint.

flowchart LR
  A(["Before
90-min outage, 6 people,
no owner, recurs"]) --> B(["Add a severity scale
+ one incident commander"])
  B --> C(["Next outage: 12 min
restore-first, clear ownership"])
  C --> D(["Blameless review
finds the real cause"])
  D --> E(["Known error logged
then permanent fix shipped"])
  E --> F(["After
faster recovery,
stops recurring"])

The same outage, run as a managed service: clear roles shrink recovery time, and the post-incident review stops the repeat. Leaders Loop

Nothing here required buying software or passing an exam. It required two ITIL ideas, restore-first incident handling, then problem management, and the discipline of a calm review. That is ITSM doing exactly what it's for: turning a recurring panic into a managed, improving service.

Frequently asked questions

Is ITIL the same as ITSM?

No. ITSM is the broad discipline of managing IT as a set of services; ITIL is one popular framework for doing ITSM, the most widely adopted, but not the only one. You can practise excellent service management while using only a handful of ITIL's ideas, or none of its terminology. Think of ITSM as the goal and ITIL as one well-stocked toolbox for reaching it.

Does ITIL clash with DevOps, Agile or SRE?

It needn't, but a heavyweight ITIL implementation can. The friction is almost always about change control: if ITIL becomes a slow approval board standing between developers and production, it fights fast delivery. ITIL 4 was rewritten partly to make peace here, and its "value streams" language overlaps with DevOps thinking. The DORA research points to automating and lightening change approval rather than abolishing the discipline, keep the shared vocabulary, drop the gatekeeping that buys you nothing.

Do we need to get certified?

Usually not as an organisation, and rarely as a precondition to benefiting. ITIL certification (now run by PeopleCert) helps individuals learn the vocabulary and is sometimes required by procurement or in managed-service contracts. But the ideas, incident versus problem, naming service owners, blameless reviews, deliver value whether or not anyone holds a certificate. Adopt the practices first; pursue certificates only where there's a specific reason.

What's the difference between an SLA and an SLO?

A service-level agreement (SLA) is a promise to a customer with consequences if you miss it, "99.9% uptime or you get a credit." A service-level objective (SLO) is an internal target you aim for, usually stricter than the SLA, so you get warning before you breach the contract. SRE practice adds the "error budget": the gap between perfect and your SLO is how much unreliability you're allowed to spend on shipping change.

Where do we even start?

List your services, name an owner for each, and agree a severity scale and a single incident-commander role before the next outage forces you to. Those three moves cost nothing and deliver most of the early value. Resist standing up a dozen formal "processes" at once, ITIL 4's own "start where you are" and "keep it simple and practical" principles tell you to add process only where it removes real pain.

Related in the Toolkit

Service management sits on top of how the underlying technology works, you can't run a service well without knowing what's under it, from how the web works to the server-side systems your services depend on, and the improvement loop at its heart is the same one behind continuous improvement.

How the web works (browsers, DNS, HTTP, status codes), the failure modes your incident playbooks have to cover start here.
Client-side (HTML, CSS, DOM, cookies), many user-reported incidents are really front-end faults, so knowing this layer speeds triage.
Server-side (databases, APIs, services), the systems behind a "service" that operations actually keeps running.
Programming & query language literacy, enough fluency to read a log, write a query, and understand a root-cause finding.
Hosting & cloud architecture, where your services live shapes their reliability, failover and recovery options.
Financial statements (P&L, balance sheet, cash flow), service costs, SLAs and downtime all land on the numbers a leader is accountable for.
Lean, Six Sigma, Kaizen & continuous improvement, ITIL's "continual improvement" is the same improvement discipline applied to services.
CI/CD pipelines, automating change is how you keep ITIL's change control without the bureaucracy DORA warns against.

Where to go next

"ITIL 4 Explained", ITSM.tools, a free, plain-English tour of the service value system, four dimensions and guiding principles if you don't want to buy the official book first.
Accelerate, Forsgren, Humble & Kim (2018), the DORA research on what actually makes software delivery and operations reliable; read it alongside ITIL to avoid the gatekeeping trap.
Site Reliability Engineering, Google (free online), Google's classic on running services at scale: SLOs, error budgets, on-call and blameless post-mortems, the modern complement to ITIL operations.
"What's the Difference Between DevOps and SRE?", Google Cloud Tech (YouTube), a short, clear talk from the "class SRE implements DevOps" series on how reliability practice relates to delivery culture.