Hosting & cloud architecture: a leader's guide

Somewhere right now, a vendor is telling your team their product is "in the cloud," and a slide is implying that this makes it safer, cheaper and someone else's problem. Two of those three are negotiable; the third is simply wrong. The cloud doesn't make outages or security disappear, it moves the line of who handles which part, and a surprising number of expensive failures come from a leader misreading exactly where that line sits.

The quick version

Hosting is just where your software runs so other people can reach it. "The cloud" means renting that capacity on demand from a provider's data centres instead of buying and racking your own servers.
You rent the stack in layers, from raw machines (IaaS) up to a finished app you just log into (SaaS). The higher you rent, the less you manage and the less you control.
Cloud is a shared-responsibility deal: the provider secures the cloud, you secure what you put in it. Most cloud breaches are the customer's half, not the provider's.
Reliability doesn't come from buying a perfect machine, it comes from redundancy: assuming every part will eventually fail, and building so the whole keeps running when one does.

The idea in depth

Start with the plain meaning. To "host" an application is to keep it running on a computer that's reachable over a network, so a browser or another service can use it. For decades that meant a company bought physical servers, put them in a room with cooling and a generator, and paid people to keep them alive. Cloud computing replaced the buying with renting. The US National Institute of Standards and Technology, in its short, widely-cited SP 800-145 (2011), defines it as "on-demand network access to a shared pool of configurable computing resources … that can be rapidly provisioned and released with minimal management effort." In plainer terms: capacity you turn on in minutes, pay for by the hour, and turn off when you're done.

So stop treating "are we on the cloud?" as the decision. Almost everything is, somewhere. The real decisions are how much of the stack you rent, and how you've wired it together, which is what the rest of this piece is about. For where that capacity actually lives and how requests reach it, our note on how the web works is the companion to this one.

You rent the stack in layers (IaaS, PaaS, SaaS)

NIST's SP 800-145 also names the three classic service models, and they're best understood as how high up the stack you start renting. At the bottom, Infrastructure as a Service (IaaS) rents you raw virtual machines, storage and networking, you still install the operating system, the database, your code, and you patch all of it. In the middle, Platform as a Service (PaaS) rents you a managed environment: you push your code and the provider runs the servers and operating system beneath it. At the top, Software as a Service (SaaS) rents you a finished application, your payroll tool, your CRM, where you control little more than your settings and your data.

The trade-off is the same one all the way up: the higher you rent, the less you have to manage and the less you can change. A common analogy is travel, own a car, lease a car, or take a taxi, but the cleaner mental model is a kitchen. IaaS is renting a kitchen and cooking from scratch; PaaS is a meal-kit where the prep is done and you just cook; SaaS is the restaurant.

flowchart TB
    subgraph SaaS["SaaS, you manage: your data & settings"]
      direction LR
      S1("Finished app, just log in")
    end
    subgraph PaaS["PaaS, you manage: your code"]
      direction LR
      P1("Push code; platform runs the rest")
    end
    subgraph IaaS["IaaS, you manage: OS, runtime, code, patching"]
      direction LR
      I1("Rent raw machines & storage")
    end
    IaaS --> PaaS --> SaaS

The higher up you rent, the less you manage, and the less you control. Leaders Loop

"The cloud" is a shared-responsibility deal, and the limit of it

The most consequential misunderstanding in this whole area is who keeps things secure. Cloud providers publish a shared-responsibility model precisely because customers keep getting it wrong. AWS's version draws the line as security of the cloud versus security in the cloud: the provider is responsible for the infrastructure, the hardware, the data-centre buildings, the underlying software, while the customer is responsible for what they put on top: their data, their access controls, their configuration, who can see what.

That line moves depending on which layer you rented. Buy IaaS and you own patching the operating system; buy SaaS and the provider handles far more of it. But one thing never transfers: your data and who can access it stay yours to protect. The uncomfortable evidence is that the headline cloud breaches are overwhelmingly the customer's half of the bargain, a storage bucket left public, a credential committed to code, an over-permissive role, not a provider failing to lock the data centre. Here's the question to put to any vendor or team, bluntly: in this setup, who is responsible for patching, and who is responsible for our data being misconfigured? If the answer is "the cloud handles it," that answer is wrong, and you've found a risk.

Here's the honest limitation. Drawing this line cleanly is genuinely hard, and providers have a commercial incentive to draw it generously in their own favour, the model is a real engineering framework, but it's also a liability boundary. Treat it as a prompt for a conversation with your own people, not as a contract that absolves you. The deeper "who decides and who answers for it" questions live in decision rights & escalation, and they apply to infrastructure as much as to strategy.

Reliability is redundancy, not perfection

The instinct of most non-technical leaders is to want a system that "doesn't go down." That target is the wrong one, and the people who run the largest systems on earth say so plainly. Amazon's long-serving CTO Werner Vogels built a whole design philosophy on a single line, "everything fails, all the time", meaning individual machines, disks, even whole data centres will break, so you design expecting it rather than hoping against it. Google's Site Reliability Engineering book is blunter still: "100% is probably never the right reliability target: not only is it impossible to achieve, it's typically more reliability than a service's users want or notice." And the cost curve is brutal, the same chapter notes that "an incremental improvement in reliability may cost 100x more than the previous increment."

So reliable systems don't chase a flawless component. They use redundancy: more than one of everything that matters, spread across independent failure zones, so that when one part dies, and it will, another carries the load. Cloud providers sell this directly: "availability zones" and "regions" are just physically separate data centres you can spread your system across, so a fire or flood in one doesn't take you down.

The reliable systems aren't the ones that never fail. They're the ones built so that a failure of any single part doesn't matter.

The practical move is to reframe the conversation with your engineers. Don't ask "will it ever go down?", ask "what's our target, in plain numbers, and what does the next nine cost?" Three nines of availability (99.9%) allows about nine hours of downtime a year; four nines (99.99%) allows under an hour, and the gap between them can multiply your bill. That's a business trade-off about how much downtime your customers will actually tolerate, which is your call to make, not a purely technical one. It's the same expected-value logic as any other investment, see decision theory & expected value.

flowchart TD
    U(["User request"]) --> LB("Load balancer: sends traffic to a healthy server")
    LB --> Z1("Zone A: app server + database copy")
    LB --> Z2("Zone B: app server + database copy")
    Z1 -. "Zone A fails" .-> X(["Traffic shifts to Zone B; service stays up"])
    Z2 --> X

Redundancy in practice: spread across independent zones so one failure doesn't reach the user. Leaders Loop

A worked example

Picture a fast-growing retailer whose online store runs on a single rented virtual machine, one server, one database, in one data centre. It's cheap and it works, right up until the Friday of a big sale, when that one machine falls over under load. The site is dark for three hours. The post-mortem blames "a cloud outage." It wasn't. The cloud did exactly what was asked of it; the architecture simply had one of everything, so one failure was a total failure.

The redesign doesn't buy a bigger, "more reliable" server, that's chasing perfection up the 100x cost curve. Instead it adds redundancy. The app now runs on two modest servers in two separate availability zones, behind a load balancer, a traffic cop that checks which servers are healthy and routes around any that aren't. The database is mirrored to a standby copy in the second zone. Now when a server dies mid-sale, the load balancer simply stops sending it traffic; customers never notice. The store survives the loss of an entire data centre.

The numbers make the trade-off concrete (illustrative figures, to show the shape). Say the single server costs about $200 a month and the sale-day outage costs $80,000 in lost orders and refunds. The redundant setup might cost $550 a month, roughly $4,000 a year more. Weighed against even one avoided outage, the maths isn't close. But push it further: a fully multi-region, four-nines design might run $3,000 a month and shave the theoretical downtime from one hour a year to a few minutes. For a retailer whose customers happily retry a page, paying ten times more to save fifty minutes a year is the wrong call. Same principle, opposite answer, because the right amount of reliability is set by what failure actually costs you, not by how much you can buy.

Frequently asked questions

Is the cloud always cheaper than running our own servers?

No, and assuming so is a classic budgeting error. Cloud is cheaper for spiky, unpredictable, or fast-growing workloads, because you pay only for what you use and can scale in minutes. For a steady, predictable, heavy workload running 24/7, owning hardware can work out cheaper over a few years. Several well-known companies have moved big workloads off the cloud to cut costs. The win isn't always price; it's flexibility, speed, and not tying up capital in machines.

What's the difference between a "region" and an "availability zone"?

A region is a geographic area (say, Sydney, or Frankfurt). Within it are several availability zones, physically separate data centres, far enough apart that a single fire, flood or power cut won't hit them all, but close enough to talk fast. Spreading across zones protects you from a local failure; spreading across regions protects you from a whole-region disaster, at higher cost and complexity. Most businesses start with multi-zone.

If it's all the cloud's servers, why do I still need engineers?

Because the cloud rents you capability, not decisions. Someone still has to choose the right layers, wire the redundancy, set up access controls, watch the costs, and respond when something breaks. The cloud removed the work of racking physical machines; it added the work of architecting and governing what runs on them. The shared-responsibility model is exactly the part that stays with your team.

Is "serverless" really running with no servers?

No, there are still servers; you just don't manage or even see them. "Serverless" means you hand the provider a small piece of code and it runs on demand, scaling up and down automatically, and you pay per execution rather than for a machine sitting idle. It's the rental-layer idea pushed to its limit: maximum convenience, minimum control. Great for unpredictable or bursty work; less suited to long-running, heavy jobs.

Should we put everything with one provider, or spread across several?

Mostly, concentrate, running well across multiple cloud providers is genuinely hard and expensive, and the redundancy you actually need (multiple zones and regions) is available within one provider. The real risk to weigh is lock-in: how painful would it be to leave? Keep your data portable and your core logic provider-neutral where you can, but resist multi-cloud as a default; treat it as a deliberate choice with a real reason, not an insurance policy you'll never collect on.

Related in the Toolkit

How the web works (browsers, DNS, HTTP, status codes), how a user's request actually finds and reaches the servers you've hosted.
Client-side (HTML, CSS, DOM, cookies), what runs in the user's browser, versus what runs on the hosting you've just read about.
Server-side (databases, APIs, services), the application logic and data that lives on the servers in your cloud architecture.
Programming & query language literacy, the code and queries your team deploys onto the layers you rent.
Monoliths vs microservices, how you split an application changes how you host and scale it.
Financial statements (P&L, balance sheet, cash flow), cloud shifts spend from up-front capital to ongoing operating cost; it shows up here.
Lean, Six Sigma, Kaizen & continuous improvement, reliability targets are a quality-level discipline, measured the way a factory measures defects.
Engineering productivity & delivery metrics (DORA), the metrics for whether your architecture lets teams ship and recover fast.

Where to go next

NIST, "The NIST Definition of Cloud Computing" (SP 800-145), two pages, vendor-neutral, the canonical definitions of IaaS, PaaS and SaaS. The clearest short reference there is.
Werner Vogels, "The Frugal Architect" keynote (AWS re:Invent 2023), Amazon's CTO on designing cloud systems for cost and sustainability, from someone who runs this at planetary scale. A practical complement to the reliability ideas here.
Google, Site Reliability Engineering (free, online), the industry's reference on running reliable systems; the "Embracing Risk" chapter is the one to read on why 100% is the wrong target.
AWS, the Shared Responsibility Model, the source for the "security of vs in the cloud" split; read it once and you'll never misjudge who's accountable again.