Skip to main content

Prompting gotchas: how a prompt breaks in production

A prompt that works in the playground works on the input you typed, on the model you typed it into, on a path you walked yourself. Production runs it on inputs nobody scripted, with untrusted text in the window, against a model version that shifts under you — and a prompt failure looks like a confident answer, not an error. Ten gotchas that break a prompt after the demo, each with the question to answer first and the cost of getting it wrong.

Production note·Last updated 2026-06-03·Drafted by Lira · Edited by German Medina

A prompt that works in the playground proves the model can do the thing once — on the input you typed, in the model you typed it into, down a path you walked yourself. Production is a different room: the model runs on inputs nobody scripted, with untrusted text sitting in the window next to your instructions, against a model version that can shift under you between one quarter and the next. The prompt did not get worse; the conditions it never saw arrived. And a prompt failure does not announce itself as an error — it comes back as a confident, well-formed, wrong answer, or a shape your parser chokes on, and someone downstream acts on it.

Ten gotchas that turn a prompt that demoed clean into one that breaks in production. They are the same shape as the ones that kill agents and grounding: the demo makes the easy path look like the whole path, and production is everything the easy path left out. Each is paired with the question to answer before you ship and the cost of getting it wrong. None of this is toolkit-specific — a prompt steering an Agentforce action, a Claude API call, or a step inside a LangGraph loop inherits every one of these. The skill is not finding clever words; it is engineering the context around the model so the words hold under conditions the playground never showed you.


The gotchas

1. The ignored instruction — a constraint buried in a wall of text the model skips

You wrote the rule. It is in the prompt. The model ignored it anyway — not because it can't follow instructions, but because the one that mattered was the eleventh sentence in a dense paragraph, weighted the same as the ten around it. Salience is not a function of you having typed it; a constraint with no structural emphasis competes with everything else in the window and loses as often as it wins.

The cost is a rule you can prove is "in the prompt" and a model that breaks it in production anyway — the worst kind of bug to argue about, because everyone can see the instruction is there. The fix is structure and salience, not more words: pull the constraint out of the prose, give it its own line or section, state it once and clearly. The question to answer before you ship: are your hard constraints structurally prominent — set apart, not buried — or are they sentences in a paragraph hoping to be noticed?

2. Fragile format — relying on prose to emit an exact shape, with nothing enforcing it

A prompt that says "respond with the fields below" and gets the right shape in ten test runs is not producing that shape reliably — it is producing it so far. Free-form generation is non-deterministic by nature, and on the eleventh run the model adds a friendly preamble, wraps the output in a code fence, or reorders the fields, and the parser that assumed the demo's exact shape breaks on the first input you didn't try.

The cost is a pipeline that runs for a week and then fails on a Tuesday, on an output that is almost right — the kind of break that pages someone at the worst time and is maddening to reproduce. The fix is to stop trusting prose to carry structure: ask for a defined shape and enforce it (see gotcha 7). The question: what guarantees the output shape — a schema, structured output, a validation step — or is your parser betting the model will never once decide to be conversational?

3. Over-prompting — piling rules until they contradict each other and drown the task

When a prompt misbehaves, the instinct is to add: another caveat, another "always," another "never." Past a point this backfires. The rules start to contradict each other — "be concise" fighting "be thorough," an exception that quietly negates the rule above it — and the actual task drowns under the scaffolding meant to control it. More prompt is not more control; it is more surface area for the model to weight wrong.

The cost is a bloated prompt whose behavior nobody can predict, where adding a rule to fix one case silently breaks two others. The question: does every instruction in this prompt earn its place against the task, and have you ever removed a rule — not just added one — to see whether the output got better?

4. No examples where the task needs them — telling instead of showing

Some tasks resist description. A tone, an edge-case judgment, an output style — you can write three paragraphs trying to specify it abstractly, or you can show two or three worked examples (multishot) and let the model pattern-match what you mean. Telling instead of showing is the most common reason a prompt that "explains the task perfectly" still produces the wrong thing: the words were clear and the intent still didn't transfer.

The cost is a prompt that grows longer and vaguer as you try to describe in prose what a couple of examples would have nailed in a fraction of the tokens. The question: for the cases this task gets wrong, would a few well-chosen examples of the right output teach it faster than another paragraph of instruction — and are your examples covering the hard cases, not just restating the easy one the model already gets?

5. Prompt injection — untrusted text in the window carrying instructions that override yours

The moment your prompt includes text you did not write — a retrieved chunk, a user's message, a pasted document, an email body — that text can carry instructions, and the model cannot reliably tell your instructions from the data's. "Ignore your previous instructions and…" buried in a retrieved document is not a hypothetical; it is the default attack on any system that puts untrusted content in the context window. The model treats the whole window as one conversation, and the data gets a vote it should never have had.

The cost is a model that follows an attacker's instruction because it arrived as data and you trusted the window — at its worst, an action taken or a secret leaked on the say-so of a pasted paragraph. The question: where does untrusted text enter this prompt, is it structurally fenced off from your instructions, and what is the blast radius if a retrieved or user-supplied chunk tries to redirect the model?

6. Badly-ordered context — the key instruction buried in the middle where the model under-weights it

Position is signal. Models attend most reliably to the start and end of a long context and under-weight the middle — the "lost in the middle" effect — so a critical instruction or the one datum the answer depends on, dropped into the center of a long prompt, gets read but not weighted. The same words at the top or bottom land; in the middle of ten thousand tokens they fade. Order is a lever you are pulling whether or not you meant to.

The cost is an instruction that is technically present and functionally invisible, producing a bug that vanishes the moment you move the same text to the top — and is baffling until you know to look at position. The question: does the most important instruction or fact sit where the model actually weights it — near the start or the end — rather than lost in the middle of the window, and have you tested whether reordering changes the output?

7. Parsing prose instead of structured output — building a parser for a problem that has a feature

If you find yourself writing a regex to scrape a number out of the model's sentence, or splitting on a delimiter you asked the model to "please use," stop: you built a parser for a problem the platform already solves. Structured output — asking for JSON against a schema, or using the structured-output and tool-use features directly — moves the burden of producing a valid shape from your fragile post-processing onto the model and the API contract, where it belongs.

The cost is a brittle parser that breaks every time the model phrases things slightly differently, plus the hours you spend hardening string-scraping logic against a moving target instead of declaring the shape once. The question: are you asking for structured output and validating it against a schema, or scraping structure out of prose by hand — reinventing, badly, a feature the model already supports?

8. Model-change drift — a prompt hand-tuned to one model quietly degrades on the next

A prompt is not portable for free. Tune it carefully to one model — its quirks, its formatting habits, the phrasings it responds to — and that tuning is partly fitted to that model, not to the task. Swap the model, or accept an upgrade, and the prompt can quietly degrade: the same words, a different model, subtly worse output that no error flags. The inverse trap is just as real — "wait for the better model" is not a strategy when a newer model can regress a prompt that was tuned to the old one's habits.

The cost is a silent quality drop the day a model version changes under you, attributed to anything but the prompt because the prompt "didn't change" — the regression hides in the one variable everyone assumes is stable. The question: is each prompt re-evaluated against your eval set when the model version changes, and do you treat a model swap as a change that has to be scored — not a free upgrade you can assume is strictly better?

9. No eval per prompt change — every tweak is an unscored coin-flip

Editing a prompt because the new version "looks better" on the one input you happened to try is not improvement — it is a guess you can't grade. Without a test set of real cases with known-good outcomes, scored on every change, you cannot tell a fix from a regression, and you ship both with equal confidence. This is principle 3 (if you can't evaluate it, you can't ship it) at the prompt level: the prompt is the most-edited and least-tested artifact in most AI systems.

The cost is a prompt that drifts in the dark — each "improvement" trading a fix you can see for a regression you can't, until quality erodes and nobody can say which edit did it. The question: do you have an eval set this prompt runs against on every change, and does an edit have to score better — not just look better on one try — before it ships?

10. Cost and caching ignored — a huge static preamble re-sent on every call

A long system prompt, a fat set of instructions, a block of examples — sent once in a demo, it is free. Sent on every call at production volume with no prompt caching, that same static preamble is the bill: you pay to re-process the identical leading tokens on every single request, and at a million calls a month the prompt that was fine in the demo is a cost line nobody approved. This is principle 6 (cost and latency are features) at the prompt level — the prompt's length is a per-call price, paid forever.

The cost is a system that is correct and quietly uneconomical — the right answer at a token bill that scales linearly with traffic and never had to. The question: what does this prompt cost per call at your real volume, is the large static portion served from prompt caching rather than re-sent every time, and have you separated the stable preamble (cacheable) from the per-request input (not) so the cache can actually do its job?


The throughline across all ten: a prompt that ships is structured for salience, enforced in its output shape, defended against the untrusted text in its window, ordered so the model weights what matters, evaluated on every change, and budgeted in cost — or it is a prompt that worked once in the playground and has not yet met an input you didn't write. Every gotcha above is a place the playground's single clean run and the production reality of running-on-everything diverge, and the wrong input is always waiting in the traffic you have not tested against.

Closing

These ten are the prompt failures Cleon has seen most often, across both Agentforce-native and external builds. The discipline that prevents them is the same one that runs through all of AI engineering: the playground makes the easy path look like the whole path, and production is everything the easy path left out. The prompt is the most-edited artifact in an AI system and usually the least-tested, which is exactly why the discipline matters — treat it as engineered context, not a string you tweak by feel, and eval every change against real cases before it ships. Get that right and most of these never fire; skip it and you ship a prompt that breaks confidently, on an input the demo never showed. None of them is hard to prevent up front; all of them are expensive to discover live, because a prompt bug looks exactly like a correct answer until someone checks.

If a prompting gotcha bit your team and isn't here, write to hello@wearecleon.com — we add it, with credit.

Related

Reference: