Debugging prompts: isolate the variable, then fix

A prompt misbehaved and someone is waiting on the fix. The instinct is to open it and rewrite the whole thing — and it is the wrong instinct, because a prompt is non-deterministic, so a rewrite that "looks better on one try" has proven nothing about whether it actually fixed the bug or just rolled the dice and got a better face this time. The move that changes everything is the one most teams skip under pressure: change one thing at a time and score it against an eval set. Isolate the variable. A prompt is several instructions, an ordering, maybe some examples, all interacting — rewrite five things at once and even if the output improves you cannot say which change did it, or whether you fixed the bug while quietly breaking a neighbor. One edit, scored against real cases, tells you what moved (principle 3). The full rewrite, eyeballed on a single run, tells you nothing you can trust.

This page assumes you have an eval set — a handful of real cases with known-good outcomes — because nothing below works without one: a prompt fix you can't score is a guess wearing a green check, and on a non-deterministic system a single passing run is not a score. It is symptom-driven: find the failure that matches what you're seeing, inspect the actual output at the layer the section points to, change one variable, and — the step that separates debugging from whack-a-mole — turn the failing case into an eval case so the regression can't return. None of this is toolkit-specific: a prompt steering an Agentforce action, a Claude API call, or a step inside a LangGraph loop fails in the same five shapes, and the eval set is where you catch all of them.

The loop: read the actual output → isolate one variable → eval-driven fix

Every debug session below runs the same three steps, in this order:

Read the actual output. Pull what the prompt actually produced on the failing case — the raw text, not your memory of it, and across several runs, not one, because one run of a non-deterministic system hides whether the failure is constant or intermittent. The bug names itself differently depending on whether it fires every time or one time in five, and you only know which by reading more than a single run (principle 11).
Isolate one variable. Change exactly one thing — move an instruction, add one example, tighten one rule, drop one that fights another — and hold everything else fixed. The whole reason a prompt is hard to debug is that its parts interact; the discipline that beats that is the same one a controlled experiment runs on, one variable at a time. Rewrite the whole prompt and you have run no experiment at all — you swapped one black box for another.
Eval-driven fix. Add the failing case to the eval set before you change anything, watch it fail, make the one change, watch it pass, and run the whole set to confirm you didn't break a neighbor. The eval case is what makes the fix permanent — without it, the same failure walks back in on the next prompt edit or model update and nobody notices (principle 3).

It ignores an instruction

How to recognize it. The rule is in the prompt — you can point to it — and the model breaks it anyway. The worst kind of bug to argue about, because everyone can see the instruction is right there in the text, and the model still didn't follow it.

What to inspect. Find where the instruction sits and what it's competing with. Two shapes, both covered in system prompts and instructions and prompting gotchas (gotchas 1 and 3). First, salience: the rule is buried mid-prompt, the eleventh sentence in a dense paragraph, weighted the same as the ten around it — present but not prominent. Second, contradiction: another instruction quietly fights it — "be concise" against "be thorough," an exception that negates the rule above it — and the model is obeying the one you forgot was there. Read the prompt for both, because the fix is opposite: one needs the rule made louder, the other needs the conflicting rule removed.

The fix. If it's salience, pull the constraint out of the prose and give it structure — its own line or section, stated once and clearly, where the model actually weights it. More words is not the fix; prominence is. If it's contradiction, the fix is subtraction: remove or reconcile the rule that fights the one you want, because adding a third "always" on top of two that already disagree just deepens the hole (gotcha 3). Either way, change the one thing the inspection indicted — don't reword the whole prompt and lose track of which move mattered.

Make it stick. Add a case to the eval set that asserts the constraint holds, and run it on inputs that tempt the model to break it — not just the easy one where it already complies. The eval is what proves the restructure or the removal fixed the real case and didn't quietly loosen a rule that used to hold.

It produces the wrong or an inconsistent format

How to recognize it. The shape is wrong — a friendly preamble before the JSON, fields reordered, the output wrapped in a code fence, a key renamed — or it's right on some runs and wrong on others. The parser downstream chokes on the first output you didn't try.

What to inspect. Read the raw output across several runs, not one, because format failures are the most intermittent kind — the prompt emits the right shape ten times and breaks on the eleventh, so a single passing run tells you nothing (gotcha 2). Lay several runs side by side and look at what varies: is it the same field every time, or does the model drift somewhere new each run? The pattern tells you whether you're fighting one weak spot or relying on prose to carry structure it can't hold.

The fix. Move from prose to structured output, not "please use the format." Asking more firmly for the right shape is still asking a non-deterministic generator to behave; the fix is to stop trusting prose to carry structure and declare the shape against a schema — the full mechanic is structured output. That moves the burden of producing a valid shape from your fragile parser onto the model and the API contract, where it belongs, and turns "almost the right shape" into a constraint the platform enforces rather than a hope you re-word.

Make it stick. Add an eval case that asserts the shape, not the prose — the output parses, every required field is present, the types are right — and run it across enough cases to catch the intermittent break, since a format bug that fires one run in ten passes a single-run check every time. The eval is what proves the schema held under the inputs that used to break the parser.

Its output is inconsistent run to run

How to recognize it. Same input, different answers — not wrong exactly, just unstable, drifting in content or judgment from one run to the next. A reviewer can't trust it because it won't say the same thing twice.

What to inspect. First, the question almost nobody asks: does this task even have one right answer? Some tasks are genuinely open — a brainstorm, a draft, a summary with many valid forms — and demanding determinism from them is fighting the wrong battle; the fix there is a human gate, not a tighter prompt (principle 8). But if the task does have a stable correct answer and the model wanders anyway, inspect what's underspecified: a vague instruction the model fills differently each run, a judgment call you described in prose instead of pinning with an example, room you left that the model uses (gotcha 4, and system prompts and instructions).

The fix. Lower the variance at its source. Clearer, more direct instructions remove the room the model was wandering in; a couple of worked examples (few-shot) pin a judgment or a shape that prose only gestured at, by showing instead of telling; and where your platform exposes it, a lower temperature reduces sampling randomness — useful for a task that wants one stable answer, the wrong move for one that wants range. Reach for the cheapest lever first — clarity usually beats every other knob — and change one at a time so you can see which one actually steadied the output. The Anthropic guidance on this is in the reference below.

Make it stick. Add an eval case that scores consistency, not a single run: run the same input several times and assert the answers agree on what matters — the classification is stable, the required facts are constant, the shape holds. A consistency bug is invisible to a one-run check by definition, so the eval has to sample more than once, or the drift walks back in looking fine.

It degraded after a model change

How to recognize it. Same prompt, new model — an upgrade, a version bump, a swap — and the output is subtly worse. Nothing in the prompt changed, so the regression hides in the one variable everyone assumes is stable.

What to inspect. Re-run the whole eval set on the new model and read the diff against the old model's scores. The prompt was partly fitted to the old model — its formatting habits, the phrasings it responded to, the quirks you tuned around — and that tuning does not transfer for free (gotcha 8). The eval diff localizes it: which cases dropped, and on what kind of input, tells you where the new model reads your prompt differently than the old one did. Without the set, you have a vague sense that "it got worse" and no way to say where.

The fix. Re-tune the prompt against the eval on the new model — don't port the old prompt and hope. The instructions that carried the old model may need re-phrasing for the new one; an example that pinned a behavior before may now be redundant or even counterproductive. Make each change against the set on the new model, one variable at a time, until the scores recover. This is also why "wait for the better model" is not a free upgrade: a newer model can regress a prompt tuned to the old one's habits, and only the eval tells you whether the swap helped or hurt.

Make it stick. The eval set is the mechanism here — make it run on every model change, not just this one. A model swap that isn't scored against the set is a silent quality change shipped on faith; a model swap that is scored turns "the upgrade quietly made it worse" into a red test the day the version moves, instead of a degrade a user finds for you.

It's right sometimes, wrong on edge cases

How to recognize it. The prompt handles the common inputs fine and fails on the unusual ones — a long input, an empty field, an ambiguous phrasing, a format it rarely sees. The happy path is solid; the margins leak.

What to inspect. Don't debug the failures one at a time — collect them and inspect them as a cluster. Pull every wrong case you have and ask what they share: the same missing field, the same input length, the same ambiguity, the same category the prompt never had an example for. A single edge-case failure looks like noise; the same failures lined up reveal the one thing the prompt doesn't handle, and that shared trait is what you actually fix — not five separate patches that each fight one symptom.

The fix. Target the cluster, not the instances. Usually one of two moves: add a worked example (few-shot) that demonstrates the right handling for that case — the highest-leverage fix when the shape of "right" is easier to show than to describe (system prompts and instructions) — or add one explicit rule for the case the examples don't cover. Resist bolting on a separate caveat per failing input; that is how a prompt grows into an unmaintainable wall of exceptions (gotcha 3). One example or one rule that covers the whole cluster beats five that each cover one.

Make it stick. Add the edge case to the eval set — the actual one that failed, with its known-good outcome — so it can't regress. The edge cases that bit you in production are the most valuable cases in the set, because they encode the ways your task actually breaks; every one you add is a margin the next prompt edit can't quietly re-open without a red test.

The throughline

Five failures, one method underneath: read the actual output across several runs, isolate the one variable the symptom indicts, change only that, and lock the fix behind an eval case so it can't quietly come back. The prompt's non-determinism is exactly why the full rewrite fails and the controlled change wins — you cannot prove a fix on a system whose output moves between tries by eyeballing one run, and you cannot tell which of five simultaneous edits helped. The prompt is the most-edited and least-tested artifact in most AI systems, which is why this discipline matters more here than anywhere: every fix above is a place an unscored "it looks better" would have shipped a regression you couldn't see, and the eval set is where you catch the divergence instead of theorizing about it.

If a prompt failure bit your team and isn't here, write to hello@wearecleon.com — we add it, with credit.

Prompting gotchas — the failure modes this page operationalizes, each with the question to answer first
System prompts and instructions — where salience, ordering, examples, and the ignored-instruction fix are decided
Structured output — the fix for the wrong-or-inconsistent-format symptom
What is context engineering — the discipline these symptoms live inside
Context windows — the budget an edge-case-on-long-input failure often traces back to
Prompt caching — what an edit to a cached preamble costs, before you tune in a loop
Prompting Style Guide — the bar that keeps these debug sessions from happening twice
Debugging agents — the same method for a full agent run, not just the prompt
Debugging grounding — the retrieval-side companion when the wrong answer is a retrieval miss, not a prompt one
AI Engineering principles — evaluate before you ship (3), non-determinism needs a human gate (8), trace everything (11)

Reference: