Evaluation gotchas: how a measurement lies to you

The whole point of evaluation is to replace "it looked good" with a number you can stand behind — the measurement that says the prompt improved, the agent didn't regress, the new model is safe to ship. That number is the gate between a system you're guessing about and one you've actually verified. Which is exactly why a bad eval is more dangerous than no eval: no eval leaves you honestly uncertain, but a broken one hands you false confidence — a green dashboard over a system that's quietly worse, and you ship on the strength of a measurement that was never measuring what you thought.

Ten gotchas that turn an evaluation into theater — a number that looks like proof and isn't. They are the failures that hit the eval set and the observability around it, the regression net that the agents, grounding, and prompting work all hang from. If that net has holes, every "we tested it" upstream is worth less than it sounds. Each gotcha is paired with the question to answer before you trust the number and the cost of getting it wrong. None of this is toolkit-specific: whether you score with Agentforce Testing Center and read traces from Agentforce Observability, run an offline suite in LangSmith, or grade with Claude as a judge over the Anthropic API, the same ways a measurement lies apply to all three. Evaluation and observability are one discipline over three surfaces — the platform layer, the model layer, the off-platform tracing stack — composed by where the system runs, not three rivals you choose between.

The gotchas

1. Eval-set leakage — test examples that bled into dev, so you measure memory not capability

The cleanest-looking score in the world is worthless if the system already saw the answers. When a test example leaks into the prompt's few-shot block, into fine-tuning data, or into the retrieval index the agent reads from, the eval stops measuring whether the system can do the task and starts measuring whether it memorized this instance of it. The number goes up; the capability didn't. Anthropic's own guidance is built on a held-out test set for exactly this reason — held out means the system never trained or tuned on it.

The cost is the worst kind of false confidence: a high score you trust, a launch you green-light on it, and a system that collapses on the first genuinely-unseen input because all you ever proved was recall. The fix is a strict wall between the eval set and everything the system learns from — examples in the prompt, tuning data, the retrieval corpus — and a periodic check that no test case has quietly crossed it. The question to answer first: can you prove this eval set is held out — that not one of these examples appears in the prompt, the tuning data, or anything the system retrieves at run time?

2. Gaming a single metric — Goodhart eats your goal

"When a measure becomes a target, it ceases to be a good measure." Pick one number — accuracy, a similarity score, average rating — make it the thing everyone optimizes, and the system will climb that number by whatever route is cheapest, including routes that abandon the goal the number was supposed to stand for. A summarizer optimized for ROUGE overlap learns to echo the source; a support bot optimized for "resolved" learns to close tickets the customer reopens. The metric rises while the thing you cared about falls.

The cost is a system that scores better every sprint and serves customers worse — a chart going up and to the right while the actual outcome erodes underneath it, invisible until someone reads the transcripts. The fix is to measure along several axes that can't all be gamed at once — Anthropic's own examples pair task fidelity with a separate non-toxicity rate and a latency bound precisely so one number can't be juiced in isolation — and to keep at least one qualitative read on real outputs as a check on the quantitative ones. The question: if the system maximized this metric by any means available, would it still be doing the job you actually want — or have you handed it a number it can win while losing the point?

3. LLM-as-judge bias — an unanchored judge drifts

Using a model to grade outputs is fast, scalable, and the right call for nuanced judgments — Anthropic lists LLM-based grading as suitable exactly where code-based grading is too rigid. But a judge is a model, and models have biases: position bias (favoring whichever answer came first), verbosity bias (scoring the longer, more confident-sounding answer higher regardless of correctness), and self-preference (a model rating its own family's outputs more kindly). An unanchored judge — "rate this 1 to 5" with no rubric — drifts on all of them, and your scores measure the judge's tilt as much as the output's quality.

The cost is a scoreboard that looks rigorous and ranks by verbosity and order — you pick the "better" prompt because its outputs were longer, not better, and the bias is baked into every decision the eval drove. The question: is your judge anchored to an explicit rubric, asked to reason before scoring, and controlled for position and verbosity — or is it an unconstrained "rate this" that's quietly grading on length and order?

4. No baseline — "better" with nothing to be better than

"The new prompt is better" is not a claim you can make without a measured before. If you didn't score the old version on the same set, "better" is a feeling, and feelings about AI output are exactly what evaluation exists to replace. Anthropic frames a good success criterion as a measurable delta against a baseline — "a 5% improvement over our current baseline" — and the 5% is meaningless without the baseline number sitting next to it.

The cost is a team that ships changes it can't defend — every release justified by an improvement nobody measured, and no way to tell, three sprints later, whether the system is better or worse than where it started. The fix is to measure the current version before you change anything, store that number, and report every result as a delta against it. The question: do you have a recorded baseline score for the version you're replacing — or is "better" a claim with no number under it?

5. Too few examples — noise drowns the signal

A ten-example eval tells you almost nothing. With a set that small, one ambiguous case or one lucky guess swings the score by ten points, and you can't tell a real improvement from sampling noise. Anthropic's design principle is blunt about the trade-off — prioritize volume over quality: more questions with slightly-lower-signal automated grading beats fewer questions hand-graded to perfection, because statistical power comes from numbers, and a confident-looking percentage over a handful of cases is a confident-looking lie.

The cost is decisions made on noise dressed as signal — you "improved" the score from 70 to 80 percent and shipped, when over ten examples that's one case flipping and well inside the margin where nothing real happened. The fix is enough examples that a single case can't move the verdict, and — when hand-writing hundreds is the blocker — using the model to expand a seed set, which Anthropic explicitly recommends. The question: is this eval set large enough that the result survives a few cases going either way — or are you reading a percentage built on so few examples that it's mostly noise?

6. Silent regression on a change — no regression set, so production tells you

Change the prompt, swap the model version, edit a tool, re-chunk the corpus — and if there's no regression set that reruns automatically on the change, the first place you learn what broke is production. AI systems regress silently: the same input that worked yesterday returns a subtly worse answer today, no error thrown, no exception logged, just quality bleeding out where nobody's watching. This is the eval set doing its highest-value job — it's the regression net under every agent, every retrieval pipeline, every prompt change — and without it each change is a bet you settle in front of customers.

The cost is finding out from a customer, a support spike, or a metric three weeks late that a change two sprints ago quietly degraded the system — and a hunt through everything that changed since, because nothing flagged it when it happened. The fix is a regression set built from real past failures that reruns on every change to the prompt, model, tools, or retrieval, with a gate that blocks the change if the score drops. The question: does a change to any part of this system automatically rerun a regression set before it ships — or do you find out it regressed when the people using it do?

7. Offline ≠ online — a passing offline eval that's worse live

A green offline eval is necessary and not sufficient. Offline evaluation, as LangSmith defines it, runs pre-deployment over a curated dataset with reference outputs — it proves the system handles the cases you collected. Production runs on a distribution you didn't curate: different inputs, different mix, the long tail you never thought to put in the set. The gap between the two is distribution shift, and a system that aces the offline suite can still degrade live because live is not the dataset.

PRODUCTION

Offline and online evaluation are complementary, not interchangeable — LangSmith draws the line cleanly: offline scores a curated dataset with known-good answers before you ship; online scores live production traces, where there is no reference output to compare against, so the evaluator has to be reference-free. You need both. Offline is the gate before release; online is the monitor after it — scoring real traffic, catching the distribution shift the dataset never contained. On Salesforce, the traces that feed the online side come from Agentforce Observability, which exports turns, LLM calls, actions, and metric scores in OpenTelemetry (OTLP) format into Data 360 or any OTLP collector. Offline tells you it's safe to ship; online tells you it's still working once real users arrive.

The cost is a launch greenlit on an offline pass that quietly underperforms on live traffic — the dataset said yes, the distribution said no, and the gap shows up as customer-visible quality nobody's eval caught. The question: do you score live production traffic, not just the curated offline set — and would you catch it if real-world inputs drifted away from the cases your dataset contains?

8. The "vibe check" — eyeballing a few outputs is not measurement

Reading three outputs, nodding, and shipping is the default failure of AI evaluation, and it's seductive precisely because the outputs are fluent — they read well, so they feel correct. But "looked good in three tries" is principle 3 stated as its own anti-pattern: a vibe, not a test. Eyeballing has no baseline, no coverage of the hard cases, no number you can compare next week, and it scales to exactly zero — you cannot eyeball a thousand outputs, and the failures live in the ones you didn't read.

The cost is a system whose quality is whatever the last person to glance at it felt that day — undefended, unrepeatable, and blind to every failure mode outside the two or three outputs that happened to get looked at. The fix is to convert the vibe into a measurement: write down what "good" means as a rubric, turn it into scored cases, and automate the grading so it runs on volume instead of vibes. The question: is "it works" backed by a score over a real set with a rubric — or by someone having read a few outputs and felt fine about them?

9. Unversioned eval set — the ruler changes under the measurement

If the eval set itself shifts — examples added, edited, or quietly dropped — without a version on it, your scores stop being comparable across time. Last month's 82 percent and this month's 88 percent mean nothing next to each other if the set changed in between: you might have an improved system, an easier test, or both, and no way to separate them. An eval set is a measuring instrument, and an instrument whose markings move can't be used to detect change — which is the one thing it exists to do.

The cost is a quality history you can't trust — a trend line plotted against a ruler that kept changing length, so you can't tell real progress from a test that got easier, and every cross-version comparison is quietly meaningless. The fix is to version the eval set like code: changes are commits, every score records the set version it ran against, and a comparison across versions is flagged as not apples-to-apples. The question: is this eval set versioned, with every score tied to the version it ran against — or is the set drifting underneath you, making last month's number incomparable to this one?

10. Log everything, measure nothing — traces with no scores attached

Capturing every trace and never scoring any of them feels like observability and isn't. Principle 11 — trace everything, you can't debug what you can't replay — is necessary, but a warehouse of raw traces with no metrics or scores attached is a haystack you only search after an incident, by hand, once you already know something's wrong. The point of observability is to surface the degrade before a customer does, and raw logs with nothing computed over them surface nothing — they're forensic material, not a monitor.

The cost is a full trace archive and a silent degrade that runs for weeks because nothing was scoring the stream — you have every record of the failure and learned about it from a customer anyway, then spend days grepping logs you could have been alerting on. The question: are your traces scored — metrics and judgments attached so a quality drop trips an alert — or are you logging everything and measuring nothing, holding a haystack you'll only search once someone tells you it's on fire?

The throughline across all ten: a measurement you can trust is held out from what the system learned, scored on more than one axis, judged by an anchored judge, compared against a baseline, sized so noise can't swing it, rerun as a regression gate on every change, validated against live traffic, versioned so it stays comparable, and attached to the traces it's supposed to watch — or it is a number that looks like proof and lies. Every gotcha above is a way an evaluation tells you the system is fine when it isn't, which is worse than not measuring, because you act on it.

Closing

These ten are the evaluation failures Cleon has seen most often, across both Agentforce-native and external builds. The discipline that prevents them is principle 3 taken seriously — if you can't evaluate it, you can't ship it — plus the part people skip: a bad eval is not a small version of a good one, it's a liability, because it converts honest uncertainty into false confidence and you ship on the strength of it. The eval set is the regression net the agents, grounding, and prompting work all hang from; a hole in the net isn't local to evaluation, it's a hole under everything upstream. Get the measurement honest and most of these never fire; skip it and you'll trust a green dashboard right up until a customer tells you it was lying.

If an evaluation gotcha bit your team and isn't here, write to hello@wearecleon.com — we add it, with credit.

Agent gotchas — the agent failures the eval set is the regression net for
Grounding gotchas — retrieval failures an offline eval has to cover before they ship
Prompting gotchas — the per-change eval that scores a prompt edit before it lands
Debugging prompts — isolate the variable, then re-score
AI Engineering principles — evals (3), the model is the easy part (4), cost is a feature (6), non-determinism needs a gate (8), trace everything (11)
Marketing Cloud AI Style Guide — which AI surface for which job
Evaluation Style Guide — the bar these gotchas distill into a checklist
Debugging evals — when one of these already bit, how to confirm which
Companion pages in this subcategory — Agentforce testing and observability, tracing and monitoring, debugging evals, and the Evaluation Style Guide — go deeper on each surface.

Reference: