Agent gotchas: how a demo dies in production

An AI agent demo is a magic trick. The inputs are scripted, the path is friendly, the room wants it to work, and it does — once. A production agent is an engineering problem: it has to be reliable on inputs nobody wrote, bounded in cost, correct in the tools it calls, safe in the actions it takes, and accountable to a human when it's wrong. The gap between the trick and the system is the entire job. That gap is the engineering.

Ten gotchas that killed agent projects Cleon has been called in to rescue — synthesized with the official guidance, the corrections the practitioner community learned the hard way, and our own production experience where both leave gaps. Each is paired with the question to answer before you ship and the cost of getting it wrong. The framing isn't "what's the right answer" — it's "what's the question you have to be ready to defend." None of this is toolkit-specific: an agent built on Agentforce, on LangGraph and the Claude API, or wired together over MCP inherits every one of these.

The gotchas

1. The demo-to-production gap — a polished demo proves nothing about reliability

A great demo proves the agent can do the thing once, on an input someone chose. Production asks whether it reliably does the thing every time, on inputs nobody scripted, at the real audience size. Those are different questions, and the distance between them is where the work lives. The demo is the start of the project, not the end of it.

The cost of confusing the two is an estimate off by a quarter and a launch that fails in week one on the inputs the demo never showed. The question to answer the moment someone says "ship that": what does this do on the messy, adversarial, half-empty input the demo carefully avoided?

2. No eval harness — you can't improve or trust what you don't measure

"It looked good in three tries" is a vibe, not a test. An evaluation set — real cases with known-good outcomes you can score on every change — is the difference between improving an agent and guessing at it. Without one, every prompt edit is a coin flip you can't grade, and every regression ships silently.

The cost is a system that drifts in the dark: you can't tell a fix from a regression, and you find out which it was from a user. The question: what is your eval set, and does it run on every change before it ships? Build it from the first ten real failures — they encode how your problem actually breaks.

3. Runaway loops and unbounded cost — an agent that re-plans forever

An agent that plans, acts, observes, and re-plans has a failure mode a single prompt doesn't: it can loop. With no step budget and no token cap, a confused agent re-plans until it exhausts the context window or the bill, and a single stuck run can cost more than a thousand good ones.

The cost of skipping the caps is a runaway that burns budget and latency on a single request. The question: what is the maximum number of steps and tokens one run may consume, and what happens when it hits that wall?

4. Hallucinated tool calls — the model invents an argument or calls the wrong tool

An agent decides which tool to call and with what arguments, and the model can get both wrong: invent a parameter that doesn't exist, pass a malformed value, or call the wrong tool entirely. If nothing validates the call against the tool's schema before it runs, a hallucinated argument executes as if it were real.

The cost is an action taken on garbage input — a record updated with a value the model made up, a query run against a field that doesn't exist. The question: is every tool call validated against a strict schema before execution, and what does the agent do when validation fails — retry, ask, or stop?

5. Over-broad tools — a tool that can delete, send, or charge is a blast radius

The moment an agent can act, the stakes change from "wrong answer" to "wrong action." A tool scoped wider than the job — one that can delete any record when it only ever needs to update one field, or send to anyone when it only ever messages the current contact — is a blast radius waiting for a bad plan to find it. Least privilege isn't optional once the agent has hands.

The cost is the worst thing the tool can do, because eventually a confused agent will try it. The question for every tool: what is the narrowest scope that still does the job, and have you actually constrained it to that — not just trusted the model to stay inside?

6. No kill switch or human gate on consequential actions

Reads are recoverable; sends, charges, and deletes are not. An agent that can take an irreversible action with no human in the loop and no way to halt it mid-run is one bad plan away from an incident you can't undo. The fix is two controls: an approval gate on anything consequential, and a kill switch that stops the agent now.

The cost of missing them is an irreversible action you watch happen with no way to intervene. The question: which of this agent's actions are irreversible, which of those require human approval before they fire, and can an operator stop a running agent on demand?

7. Context overflow — stuffing everything into the prompt degrades reasoning

More context does not make an agent smarter. Past a point it makes it worse: the signal the model needs drowns in the history, tool output, and "just in case" documents you piled in — context rot. A long-running agent that accumulates every step's output in the prompt degrades the longer it runs, exactly when you need it sharpest.

The cost is an agent that reasons worse the deeper into a task it gets, for reasons that don't show up in any error. The question: what does this step actually need in context, and how do you keep the window curated — summarizing, dropping, or retrieving on demand — instead of letting it grow unbounded?

8. Grounding gaps — the agent answers confidently from no or wrong retrieval

An agent reads the same data a human would, through retrieval. If retrieval returns nothing, or the wrong chunks, the model doesn't stop — it answers from its parameters and defends the guess fluently. A clever prompt over no grounding is a confident hallucination, and it sounds exactly like a correct answer.

The cost is a confidently wrong answer, which is worse than no answer because someone acts on it. The question: what grounds this agent, what happens when retrieval comes back empty, and can the agent say "I don't know" instead of inventing?

9. Non-determinism shipped without a fallback — same input, different output, no safety net

The same input can produce a different output tomorrow. That's fine for a draft and dangerous for anything a customer sees or a downstream system consumes unreviewed. An agent shipped with no validation step and no deterministic fallback will eventually render a model failure as a blank, an error string, or a hallucination in front of someone.

The cost is a visible failure with no floor under it — the bad output goes straight through. The question: what is the deterministic fallback when the agent fails or returns something invalid, and does every generation pass a validation gate before it's used? A boring correct fallback beats an exciting wrong answer every time.

10. The accountability gap — "the AI did it" with no human owning the outcome

An agent does not move accountability to the model. Someone owns each consequential outcome: they can explain it, defend it, and answer for it when it's wrong. Build the agent so that person exists, knows it's them, and has the controls — review, override, kill switch — to act. "The agent decided to" is not an incident report anyone wants to write.

The cost is an incident with no owner, which is how a fixable mistake becomes an organizational one. The question: for every consequential thing this agent can do, who is the human accountable for it, and do they have the visibility and controls to own that responsibility in practice?

The throughline across all ten: an agent that ships is grounded, evaluated, bounded, governed, and owned — or it's a demo that hasn't failed yet. Every gotcha above is a place the magic trick and the engineering diverge, and production finds all of them. The work that makes an agent survive Monday morning is the same work the demo skipped to look effortless.

Closing

These ten are the failures Cleon has seen kill agent projects most often, across both Salesforce-native and external builds. The shared theme is the one that runs through all of AI engineering: the demo makes the easy path look like the whole path, and production is everything the easy path left out. None of them is hard to prevent up front; all of them are expensive to discover live.

The discipline that prevents most of them is written down — see the Agent Style Guide, the bar an agent has to clear before it ships. If you want the grounding side of the story, the Data 360 agent-readiness check is where a clean model meets a safe agent.

If an agent gotcha bit your team and isn't here, write to hello@wearecleon.com — we add it, with credit.

What is an agent — the definition before the gotchas
Orchestration patterns — bounding loops and composing steps
Tools and actions — least privilege, schemas, and gates
Debugging agents — tracing a run when it goes wrong
Agent Style Guide — the bar an agent clears before it ships
Data 360 agent-readiness check — grounding the agent on a clean model

Reference: