Debugging agents: tracing a run when it goes wrong
An agent failed in production and you have to fix it. The move that makes that possible is the one most teams skip: trace first, theorize second — you cannot fix what you cannot replay. The symptom-driven playbook for five ways an agent goes wrong — runaway loops, wrong-tool calls, confident-wrong answers, silent degrade, slow and expensive — each with what to read in the trace, the fix, and the eval case that stops it coming back.
An agent failed in production and someone is waiting on the fix. The instinct is to open the prompt and start guessing — and it is the wrong instinct, because an agent's path is non-deterministic and you are about to debug a run you can't see. The move that changes everything is the one most teams skip under pressure: you cannot fix what you cannot replay. Trace first, theorize second. With the full trace of the failing run in front of you — the inputs, the retrieved context, every tool call and its result, the final output — the bug almost always names itself. Without it, you are tuning a prompt against a story you made up about what happened (principle 11).
This page assumes you traced everything before you shipped — if you didn't, the first fix is to add tracing and reproduce, because nothing below works on a run you can't replay. It is symptom-driven: find the failure that matches what you're seeing, read the trace at the layer the section points to, apply the fix, and — the step that separates debugging from whack-a-mole — turn the failing run into an eval case so the regression can't return. None of this is toolkit-specific: an agent on Agentforce, one on LangGraph and the Claude API, or one wired over MCP fails in the same five shapes, and the trace is where you catch all of them.
The loop: trace → reproduce → eval-driven fix
Every debug session below runs the same three steps, in this order:
- Trace. Pull the full record of the failing run — inputs, retrieved context, the tool calls and their arguments and results, the model's reasoning where you capture it, the final output. If you can't, you have a tracing gap to fix before anything else; an agent shipped without a trace is one you debug by superstition.
- Reproduce. Replay the run from the trace until you can make the failure happen on demand. A bug you can't reproduce is a bug you can't confirm you fixed — and an agent's non-determinism means "it worked when I tried it" proves nothing.
- Eval-driven fix. Add the failing case to the eval set before you change anything, watch it fail, apply the fix, watch it pass, and run the whole set to confirm you didn't break a neighbor. The eval case is what makes the fix permanent — without it, the same failure walks back in on the next prompt edit or model update and nobody notices (principle 3).
It loops or runs away
How to recognize it. A run that never finishes, a latency spike, or a single request that costs far more than the rest. The agent plans, acts, observes, and re-plans — forever.
What to read in the trace. Walk the steps and find where it re-plans. You are looking for one of two shapes: the agent repeats the same tool call with the same arguments and gets the same result it can't use, or it oscillates between two plans without converging. Either way, the trace shows the model never reaching a state it reads as "done" — the stop condition it needs was never expressible from what it's seeing.
The fix. Two parts. First, a hard step cap, a token budget, and a timeout enforced in code, not requested in the prompt — an agent asked nicely to "be efficient" still spirals; an agent that hits its ceiling stops (gotcha 3). The cap is the seatbelt: it bounds the cost of the failure while you fix the cause. Second, the cause — usually a missing or unreachable stop condition. The agent loops because nothing it can observe tells it the task is complete, so give it an explicit, checkable definition of done, or a tool whose result it can read as terminal.
Make it stick. Add the looping input to the eval set and assert two things: the run terminates within the step cap, and it terminates because it finished, not because it hit the wall. A run that only ever stops at the ceiling is still broken — the cap caught it, the cause is still there.
It calls a wrong or invented tool
How to recognize it. An action taken on garbage — a record updated with a value the model made up, a query run against a field that doesn't exist, the wrong tool fired for the situation. Or a call that errors at execution because an argument is malformed.
What to read in the trace. Find the tool call and read it literally: the tool name, every argument, and the value of each. Compare it against the tool's schema. The model invents a parameter that isn't there, fills a real one with a plausible-but-wrong value, or picks a tool whose description it misread. The trace shows you exactly which — and shows whether anything validated the call before it ran.
The fix. Two layers, both covered in tools and actions. First, validate every tool call against a strict schema before execution, and decide deliberately what happens when validation fails — retry, ask, or stop — so a hallucinated argument never reaches the real action (gotcha 4). Second, if the model picked the wrong tool or filled it wrong, the fix is usually the words you wrote: the model reasons over the tool's name and description, nothing else, so a vague description — "updates the account" — invites misuse. Tighten it to say what the tool does, when to use it, when not to, and what each argument means. An enum on a field the model kept inventing values for ends that class of bug outright.
Make it stick. Add the input that produced the bad call to the eval set and assert the agent now calls the right tool with valid arguments — or correctly declines. If you tightened a description, the eval is what proves the new wording fixed the real case and didn't quietly break an adjacent one.
It's confidently wrong
How to recognize it. The agent states something false, fluently, and acts as if it's certain. This is the most dangerous failure because it looks exactly like a correct answer — the only tell is that the content is wrong, and someone may act on it before anyone notices.
What to read in the trace. Suspect retrieval before the model. Find what the agent was actually given to read this run — the retrieved chunks, the grounded records, the tool results it reasoned over — and check whether the right answer was even in there. A model handed the wrong three chunks will defend the wrong answer beautifully; the failure is upstream of the reasoning, in what grounding returned (gotcha 8, principle 2). Read the retrieved context before you read the prompt.
The fix. If retrieval came back empty or wrong, fix retrieval, not the prompt — the query, the index, the grounding source, the data model under it. A clever prompt over no grounding is a confident guess; tuning the prompt to sound less wrong leaves the agent still wrong. And make "I don't have that" a path the agent is allowed to take: an agent that can't say "I don't know" will invent, every time retrieval fails it. If a human analyst couldn't answer the question from the same retrieved context, the bug is the grounding, not the agent — the agent inherited the gap.
Make it stick. Add the case to the eval set with the known-good outcome, and score the answer against the facts — not against how confident it sounds. Include at least one case where the right behavior is "I don't know," so a regression that brings the confident hallucination back fails loudly.
It degrades silently
How to recognize it. Same code, worse output — over days or weeks, with nothing in the logs to point at. Quality slips and no error fires. The usual causes are a model update underneath you, drift in the data the agent grounds on, or a slow change in the inputs users actually send.
What to read in the trace. This is the failure tracing exists for, because there is no error to catch it (principle 11). Compare a recent failing run against an older passing one on a similar input: same retrieved context? same tool results? same model version? The diff between a run that worked and one that doesn't, both captured in full, is what localizes a degrade that no exception ever announced.
The fix. Regression evals are the real fix, and they run before a user finds the problem. A fixed eval set scored on a schedule — and on every model or prompt change — turns a silent degrade into a failing test the day it starts (gotcha 2, principle 3). When the set drops, the trace diff tells you which layer moved: a model update means re-validate against the suite and pin or migrate deliberately; data drift means the grounding source changed and retrieval needs attention; input drift means real usage moved past what your eval set covers, so the set needs new cases.
Make it stick. The eval set is the mechanism here — the fix and the regression guard are the same artifact. Every silent degrade you find becomes a case in the set, so the next instance of it is caught by a red test instead of a user. An agent on a moving model without regression evals degrades in the dark; that is the default, not the exception.
It's slow or expensive
How to recognize it. The agent produces the right answer too slowly to be usable, or at a per-run cost that doesn't survive the real audience size. Cost and latency are features, not afterthoughts (principle 6) — an answer that's correct but unaffordable hasn't solved the problem.
What to read in the trace. Profile the run step by step: tokens in and out per step, latency per step, number of model calls. The trace shows you where the budget actually goes — and it's rarely where you'd guess. The usual culprits: a context window that grew unbounded as the run accumulated every step's output, a step calling a large model where a small one would do, or an orchestration with more model hops than the job needs.
The fix. Match the fix to the step the trace indicted. If context is the cost, curate the window — context is a budget, not a bucket; summarize, drop, or retrieve on demand instead of piling everything in "just in case" (principle 10), which also sharpens reasoning, since an overstuffed context degrades quality too. If a step that doesn't need a frontier model is using one, route it to a smaller tier — a small model for the routine call, a large one only where it earns it (principle 6). If a deterministic answer is recomputed every run, cache it. If the orchestration has hops the job doesn't need, collapse them — the strongest agent system that could work is usually the smallest one.
Make it stick. Add a cost-and-latency assertion to the eval set: the run stays under a token budget and a latency ceiling on representative inputs. Cost regressions are as real as correctness regressions and just as silent — without the assertion, a future change that doubles the token count ships unnoticed until the invoice does the telling.
The throughline
Five failures, one method underneath: trace the run, reproduce it, fix the layer the trace actually indicts, and lock the fix behind an eval case so it can't quietly come back. The agent's non-determinism is exactly why guessing fails and tracing wins — you can't reason about a run you didn't capture, and you can't prove a fix on a system whose output changes between tries. Every fix above is a place the demo's easy path diverged from production, and the trace is where you find the divergence instead of theorizing about it.
If a failure mode bit your agent and isn't here, write to hello@wearecleon.com — we add it, with credit.
Related
- Agent gotchas — the failure modes this page operationalizes, each with the question to answer first
- What is an agent — the control loop you trace and the eval you score against
- Orchestration patterns — tracing a run that spans more than one loop or a graph
- Tools and actions — schema validation and tool descriptions, for the wrong-tool fix
- Agentforce agents — tracing a run inside the platform
- External agents — tracing a LangGraph or Claude API run
- Agent Style Guide — the bar that keeps these debug sessions from happening twice
- AI Engineering principles — trace everything (11), ground before you generate (2), evaluate before you ship (3), cost and latency are features (6)
Reference: