AI capabilities & limits: what LLMs and agents actually do

Ask a modern AI to draft a board paper and it will hand you something fluent, structured and plausible in seconds. Ask it which of last quarter's deals actually closed, and it may hand you a number with exactly the same confidence, and no idea whether it's true. That gap, between fluency and accuracy, is the whole story.

The quick version

A large language model (LLM) predicts the next most likely word. It is a probability engine, not a fact database, fluency and truth are separate things.
Its strengths (drafting, summarising, translating, reformatting, brainstorming) and its weaknesses (arithmetic, fresh facts, "I don't know") both fall out of that one design choice.
An agent is an LLM given tools and a goal so it can act in steps. That adds reach, and multiplies the places a wrong guess can do damage.
The leader's job isn't to predict where AI is good. It's to design the work so a confident wrong answer is cheap to catch.

The idea in depth: it's predicting words, not knowing things

Almost every system people mean by "AI" today, ChatGPT, Claude, Gemini, Copilot, sits on one architecture: the Transformer, introduced by Vaswani and colleagues at Google in their 2017 paper Attention Is All You Need. The mechanism is narrower than the marketing suggests. The model reads a stretch of text and predicts the next chunk, then the next, one piece at a time, each choice shaped by the statistical patterns it absorbed from a vast pile of training text.

That sounds too simple to produce something that can pass a bar exam. The surprise of the last few years is that, at enough scale, next-word prediction alone yields summarising, translating, coding and a convincing imitation of reasoning. But the mechanism never changed, and neither did its blind spot. In 2021, linguist Emily Bender, Timnit Gebru and co-authors named it in their influential paper On the Dangers of Stochastic Parrots: a language model is "a system for haphazardly stitching together sequences of linguistic forms ... according to probabilistic information about how they combine, but without any reference to meaning." The phrase is contested, plenty of researchers think it undersells what large models do, but as a warning label it has aged well. The practical upshot: treat AI output as a confident first draft to verify, never as a lookup you can trust unread.

flowchart LR
  A(["Your prompt"]) --> B("Model predicts the
next word, then the next")
  B --> C("Fluent, plausible
text")
  C --> D{"Was it ever
checked against
reality?"}
  D -->|"No"| E(["Trust at your peril"])
  D -->|"Yes, tools,
data, a human"| F(["Safe to rely on"])

Fluency is produced by design; truth has to be added on. Leaders Loop

Why it hallucinates, and why that's structural, not a bug to be patched

When a model states a false fact in a confident tone, the industry calls it a hallucination. It is tempting to assume the next version will simply fix this. The evidence says otherwise. In their 2025 paper Why Language Models Hallucinate, OpenAI researchers Adam Kalai, Ofir Nachum and colleagues argue that hallucination is baked into how these systems are trained and graded. Pre-training rewards plausibility, so a fluent guess scores well even when it's wrong. Worse, most benchmarks mark "I don't know" the same as a wrong answer, so a model that guesses outscores one that admits uncertainty. We have, in effect, trained these systems to bluff.

This is the honest limitation to sit with: a system optimised to always produce a fluent continuation has no native way to say "I'm not sure." The fix isn't a smarter model so much as a different scoreboard, and a workflow that makes abstention safe. So build "show your sources" and "say if you're unsure" into how you ask, and reward the people on your team who flag a confident wrong answer over the ones who just ship fast.

The same fragility shows up in reasoning. Apple researchers, in their 2024 study GSM-Symbolic, took grade-school maths problems and changed only the names and numbers. Performance dropped. Add one irrelevant-but-plausible sentence to a word problem and accuracy fell by as much as 65% across leading models, a tell that they are pattern-matching to training examples more than reasoning from first principles. It's worth holding the picture honestly: these tools are genuinely useful and improving fast, and the same study would look different on the newest models. But the shape of the weakness, brittle under small, irrelevant changes, is the part to design around. (This is exactly why probabilistic vs deterministic systems is a distinction worth keeping straight: use the guessing machine for the guessing-shaped work, and a deterministic system, a calculator, a database query, a rules engine, for anything that has to be exactly right.)

Agents: more reach, more places to be wrong

The current frontier is the agent: an LLM handed tools (a web browser, your calendar, a code runner, a database) and a goal, left to plan and act over multiple steps. This is a real capability jump, an agent can do, not just say. It is also where a single confident wrong guess stops being a bad sentence and becomes a wrong email sent, a wrong record updated, a wrong refund issued.

Anthropic's engineering guide Building Effective Agents (2024) makes a point leaders should borrow wholesale: most problems don't need an autonomous agent at all. A workflow, fixed steps with the AI doing one well-scoped task at each, is more predictable, cheaper and easier to debug. Reserve true agents for tasks that genuinely need flexible, multi-step decision-making, and use the simplest thing that works. So the move is to ask, before automating anything: would a checklist with AI at one step be safer than turning a model loose? Usually, yes.

flowchart TD
  A(["A task to automate"]) --> B{"Are the steps
known in advance?"}
  B -->|"Yes"| C(["Workflow:
AI does one
scoped step"])
  B -->|"No, needs
judgement each step"| D{"Is a wrong
action reversible
and cheap?"}
  D -->|"Yes"| E(["Agent with a
human checkpoint"])
  D -->|"No"| F(["Keep a human
in the loop"])

Match autonomy to reversibility, not to ambition. Leaders Loop

A worked example

Picture Maya, who runs customer operations for a mid-sized SaaS firm. Her team drowns in support tickets, and a vendor pitches an AI agent that reads each ticket, looks up the account, and issues refunds automatically. The promised numbers are tempting (the figures here are illustrative): 70% of tickets resolved with no human, hours saved every day.

The "better, not faster" version of Maya doesn't ask "is the AI good enough?", the jagged answer is that it's brilliant at some tickets and quietly wrong on others. She asks instead: where is a wrong answer expensive, and where is it cheap? Drafting a friendly reply the agent gets wrong costs a few seconds of an agent's time to fix, let it run. Issuing a refund it gets wrong costs real money and is hard to claw back, that one is irreversible enough to keep a human in the loop. So she ships a workflow: the AI drafts every reply and proposes the action, auto-sends the low-stakes ones, and routes anything that moves money or closes an account to a person with the AI's reasoning attached. Same tool, a fraction of the risk, and the time saved is real because it's spent where mistakes are forgivable.

The skill isn't predicting where AI is good. It's designing the work so a confident wrong answer is cheap to catch.

Frequently asked questions

Will the next model just fix the hallucinations?

Newer models hallucinate less, but the OpenAI calibration work suggests some rate of confident guessing is inherent to systems trained to always produce fluent output and graded in ways that punish "I don't know." Treat it as a property to manage, not a bug awaiting a patch.

Can I trust AI with numbers and data?

Not with the raw model alone, arithmetic and fresh facts are exactly where next-word prediction is weakest. Trust it when it's wired to a deterministic tool (a calculator, a SQL query, a verified document) and shows its working. Fluent-looking maths from an unaided chatbot is the classic trap.

What's the difference between generative AI, an LLM and an agent?

Generative AI is the broad family of models that produce new content (text, images, code). An LLM is the text-and-reasoning member of that family. An agent is an LLM given tools and a goal so it can take actions in steps rather than just answering.

Are AI agents safe to deploy on real systems?

They can be, with guardrails: scope their tools tightly, log every action, and keep a human checkpoint on anything irreversible or expensive. The risk scales with autonomy times consequence, lower either and you lower the risk.

Do I need to understand the technology to lead with it?

You don't need the maths. You do need the one mental model in this piece, fluency is not truth, because every good policy about where to trust AI, and where to keep a person, follows from it.

Related in the Toolkit

Machine learning concepts & utility, the broader family LLMs belong to, and where each kind of model earns its keep.
Probabilistic vs deterministic systems, the distinction that explains why a guessing machine can't be trusted with anything that must be exact.
Algorithmic bias, explainability & model risk, what to govern once these systems touch real decisions about real people.
Data strategy & data as an asset, AI is only as good as what you feed it; the data underneath is the moat.
Data governance, quality, lineage & stewardship, the controls that decide whether an AI answer is traceable or just plausible.
First principles vs heuristics vs analogical reasoning, why models pattern-match where humans should reason, and when that matters.
Reversible vs irreversible decisions, the test for how much autonomy to hand an agent.
Jobs-to-be-Done & needs research, staying honest about the problem before reaching for an AI solution.

Where to go next

Andrej Karpathy, Intro to Large Language Models (1hr talk), the clearest general-audience explanation of how LLMs work and where they break, from a leading researcher; no maths required.
Anthropic, Building Effective Agents, the practitioner guide to workflows vs agents; read it before anyone on your team proposes "let's just give it an agent."
Kalai et al. Why Language Models Hallucinate (2025), the readable case for why confident guessing is structural, and what a better scoreboard looks like.
Ethan Mollick, Co-Intelligence: Living and Working with AI (2024), the best book for a working leader; coined the "jagged frontier" and gives practical principles for using AI without being fooled by it.