When an engineer tells you a feature is "done," they mean one of two very different things: either it works on their laptop, or it has been written in a way that proves it works, deploys itself safely, and tells you the moment it stops working. The gap between those two definitions of "done" is exactly what TDD, BDD and DevOps are about, and it is one of the few engineering topics where the right leadership question changes the outcome.

The quick version

  • TDD (Test-Driven Development) is a coding discipline: write a small failing test first, write just enough code to pass it, then tidy up, repeat. The rhythm is "red, green, refactor."
  • BDD (Behaviour-Driven Development) is TDD aimed at the business: describe what the system should do in plain "Given / When / Then" sentences everyone can read, so the spec, the test and the conversation are the same thing.
  • DevOps is a way of working that tears down the wall between the people who write software and the people who run it, automating the path from a developer's keyboard to live customers.
  • The research finding that ties them together: high-performing teams ship more often and fail less often. Speed and stability rise together, measured by four numbers known as the DORA metrics.

TDD: write the test before the code

The oldest of the three is the most counter-intuitive. In Test-Driven Development, the developer writes an automated test before the code that makes it pass. The practice was codified by Kent Beck in his 2003 book Test-Driven Development: By Example, and the loop has a name engineers still chant: red, green, refactor. Red, write a small test that fails, because the feature doesn't exist yet. Green, write the simplest code that makes it pass, sins and all. Refactor, clean up the mess now that a passing test will catch you if you break something. Repeat, one tiny behaviour at a time. Martin Fowler, who has watched the practice spread for two decades, describes those three steps as "the heart of the process."

Writing the test first sounds like extra work, and in the moment it is. But it forces a decision most teams skip: what does "working" actually mean for this piece of code? You cannot write the test until you have answered that. For a leader, the useful instinct is to treat tests as the specification rather than a clean-up chore, and to notice that a team with no automated tests is not moving fast, it is moving blind. The test suite is what lets you change code later without holding your breath.

An honest limitation. TDD is not free, and it is not religion. Tests are code too: a suite of brittle, over-specified tests can slow a team as much as none at all. The empirical record is genuinely mixed, controlled studies find smaller, less consistent effects than enthusiasts claim, and the benefit depends heavily on the team and the type of system. Treat TDD as a strong default for code that matters, not a mandate for every throwaway script.

flowchart LR
  A(["RED
write a small
failing test"]) --> B(["GREEN
write just enough
code to pass"]) B --> C(["REFACTOR
tidy up, tests
still green"]) C -->|"next tiny behaviour"| A
The TDD loop: one small behaviour at a time, with a passing test as the safety net before you tidy. Leaders Loop

BDD: tests a non-engineer can read

TDD has a quiet problem: its tests are written by developers, for developers, in the language of code. The business cannot read them, so the spec it agreed to and the tests the engineers wrote can drift apart without anyone noticing. Behaviour-Driven Development is the fix. It came out of Dan North's work in the mid-2000s, set out in his 2006 article "Introducing BDD" (first published in Better Software magazine). North's move was almost linguistic: replace the word "test" with the word "behaviour." As he put it, "I started using the word 'behaviour' in place of 'test' in my dealings with TDD and found that not only did it seem to fit but also that a whole category of coaching questions magically dissolved."

The practical artefact is the Given / When / Then template for capturing a scenario: Given some starting situation, When an event happens, Then expect some outcome. "Given an account with £100, when the customer withdraws £30, then the balance is £70" is a sentence a product manager, a tester and a developer can all read, argue about and agree on, and it can be wired up to run as an automatic test. Drawing on Eric Evans' idea of a ubiquitous language, BDD's real product is not the test; it is one shared vocabulary across business and engineering. So the question to ask, when a feature is specified, is a simple one: can a non-engineer read the acceptance criteria and recognise what they asked for? If the spec only exists in someone's head or in code, BDD drags it into the open.

BDD's trick was to stop calling them tests and start calling them behaviours, because a behaviour is something the business can argue with.

The honest limitation: BDD's plain-English scenarios are seductive, and teams routinely over-invest in them. Maintaining hundreds of Given/When/Then scenarios in tools like Cucumber can become its own slow, fragile burden, and the "anyone can write tests" promise rarely survives contact with reality. BDD earns its keep where business rules are genuinely contested and need a shared language; it is overkill for plumbing only engineers will ever touch.

DevOps: closing the gap between "built" and "running"

TDD and BDD make the code trustworthy. DevOps is about everything that happens after, the journey from a finished change to live software in front of customers, and the people who own each end of it. The name is a portmanteau of development and operations, coined by Patrick Debois around the first DevOpsDays conference in Ghent in 2009. The problem it names is organisational: developers were rewarded for shipping change, operations for keeping things stable, and the two goals pulled against each other across a wall, with releases thrown over it and blame thrown back.

The technical backbone is the deployment pipeline, set out by Jez Humble and David Farley in their 2010 book Continuous Delivery: an automated path that takes every change through build, test and release, so getting code to production is a routine, low-drama event rather than a quarterly act of courage. This is where the server-side services a team builds actually reach customers, riding on the same cloud infrastructure covered elsewhere in the Toolkit. Here, one question does most of the work: how long does it take, and how many people does it involve, to get a one-line change to a live customer? If the answer is "weeks" and "a release committee," you have found your bottleneck, and it is not the developers' typing speed.

flowchart LR
  A(["Developer commits
a change"]) --> B(["Build
(compile, package)"]) B --> C(["Automated tests
(TDD + BDD run here)"]) C --> D{"All green?"} D -->|"No"| A D -->|"Yes"| E(["Deploy to
production"]) E --> F(["Monitor & alert
(ops feedback)"]) F -.->|"problem found"| A
The deployment pipeline: every change is built, tested and released automatically, and live monitoring feeds problems straight back to the developer. Leaders Loop

The finding that ties them together: speed and stability rise as one

Here is the idea most worth carrying out of this piece. The instinct is that you choose: move fast and break things, or move carefully and stay stable. The largest body of evidence in software delivery says the opposite. The DORA programme (DevOps Research and Assessment), founded by Nicole Forsgren, Jez Humble and Gene Kim, whose findings are gathered in their 2018 book Accelerate and the annual State of DevOps reports, measures delivery performance with four metrics, two for speed and two for stability:

  • Deployment frequency, how often you successfully release to production.
  • Lead time for changes, how long it takes a commit to reach live.
  • Change failure rate, what share of deployments cause a failure.
  • Time to restore service, how quickly you recover when something breaks.

The first two measure velocity; the last two measure stability. The counter-intuitive result, repeated across years of data and summarised by Google Cloud's own "Four Keys" guidance, is that the best teams score well on all four at once: they ship more often and fail less often and recover faster. The practices in this article are how they do it, when releases are small, tested and frequent, each one is low-risk and easy to undo, so velocity and reliability stop fighting. So measure your team against these four numbers rather than asking "are we fast or are we safe?", a question the data says is the wrong one to begin with.

The honest limitation. The DORA four are correlational, not a magic formula, and they are easy to game, a team told to raise deployment frequency can split one release into ten meaningless ones. They measure delivery throughput, not whether you built the right thing. Use them as a system health check, never as a stick; the moment a metric becomes a target, people optimise the number instead of the outcome. The companion piece on engineering productivity & delivery metrics goes deeper on that trap.

A worked example

Picture a mid-sized insurance company, call it Northwind, shipping a customer portal. (Illustrative throughout; the scenario is a teaching example, not a real company.) Today, releases go out once a quarter, take six people and a printed runbook, and roughly one in three introduces a bug that takes a day to chase down. The business has concluded that software is just slow and risky, and plans around it.

Now trace the three disciplines through one small change: a "save quote for later" button. Under BDD, the product manager and an engineer write the acceptance criteria together, "Given a customer viewing a quote, when they tap Save, then it appears in their account for 30 days", and that sentence becomes an automated check. Under TDD, the engineer builds the feature one failing test at a time, so the moment the logic is wrong a test goes red on their own screen, not in front of a customer three months later. Under DevOps, the change flows through an automated pipeline that runs those tests and, if they pass, deploys it that afternoon, with live monitoring that would page the team within minutes if it misbehaved.

Map Northwind onto the DORA four and the shift is visible without a line of code in front of you: deployment frequency moves from quarterly to daily, lead time from months to hours, change failure rate falls because each release is tiny and well-tested, and time-to-restore drops because rolling back one small change is trivial. None of it required heroics, only making "done" mean tested, automated and observable instead of works on my laptop. The win was organisational, not technical: the leader who asked "why does a one-line change take a quarter?" changed more than any new framework would have.

Frequently asked questions

What's the actual difference between TDD and BDD?

They are the same loop pointed at different audiences. TDD writes tests in code, for developers, to drive the design of a small piece of software. BDD writes the same idea as plain "Given / When / Then" scenarios a product manager or tester can read, so the test doubles as a shared specification. Many teams use both: BDD for the business-facing behaviour, TDD underneath for the code.

Is DevOps a job title, a team, or a way of working?

Properly, a way of working, a culture and set of automated practices that bring development and operations together. The industry has muddied this by hiring "DevOps engineers" and standing up "DevOps teams," which can quietly rebuild the wall the idea was meant to remove. The useful test is the outcome: can a developer get a change safely to production without a hand-off to a separate, differently-incentivised team?

Do these only matter for big tech companies?

No, the smaller you are, the more they tend to pay off, because the cost of a slow, scary release process falls on a team that can least afford the lost time. A two-person startup with automated tests and a one-click deploy can out-ship a hundred-person company that releases by committee.

If my team isn't doing any of this, where do I start?

Measure, don't mandate. Ask how long a one-line change takes to reach customers and how often releases cause problems, the lead-time and change-failure ends of the DORA four. That conversation usually surfaces the real bottleneck (often a manual release process, not a lack of tests) and lets the team choose which discipline to adopt first.

Won't writing all these tests slow us down?

In the first weeks, yes; over any meaningful horizon, the evidence points the other way. The DORA data is the strongest counter to the instinct: the teams with the most discipline are also the fastest. The cost is real but front-loaded; the saving compounds.

Related in the Toolkit

These disciplines sit on top of the rest of the engineering stack: the tests and pipelines run on the server side and deploy onto cloud infrastructure, and the DORA four are best read alongside the wider delivery-metrics picture below.

Where to go next