Debugging evals: when the number lies, and how to confirm it

An eval is supposed to be the thing you trust instead of a feeling — a number you can score, compare, and defend (what is evaluation). So a misleading eval is a special kind of expensive: it's worse than no eval at all, because no eval makes you cautious and a green check makes you confident, and confident-and-wrong is how a regression ships. When the eval and reality disagree, the instinct is to argue with reality — production is "just edge cases," the reviewers are "too picky," the upgrade "should be fine." That instinct is the bug. The eval is a measurement, and a measurement that disagrees with the world is a measurement to fix, not a world to dismiss.

This page is symptom-driven, the same shape as debugging prompts: find the failure that matches what you're seeing, confirm what's actually wrong before you touch anything, and fix the eval so the lie can't repeat. It assumes you have the two assets the whole subcategory is built on — an eval set and a trace of what production actually did — because the confirmation step in every flow below is "go look at real data," and without the set and the traces you have nothing to look at. That dependency is the throughline, so it's worth stating up front: debugging an eval is mostly a test of whether you had the measurement in place before you needed it. None of this is toolkit-specific — an Agentforce agent graded in Testing Center, a Claude call scored against an Anthropic eval, or an off-platform graph evaluated in LangSmith all lie in the same three shapes.

"It passes offline but production is worse"

How to recognize it. The offline eval is green — the set passes, the score is where you want it — and the live system is visibly worse: more complaints, more escalations, more answers you wouldn't have shipped. The lab says fine and the field says broken, and the gap between them is the whole problem.

What's actually wrong. Offline is not online, and the gap has three usual causes. First, distribution shift: your eval set no longer matches the traffic. LangSmith draws this line exactly — offline evaluation runs against "datasets with reference outputs," online evaluation runs over "real production traces without reference outputs" — and if the production distribution drifted away from the set, the set is grading a world that no longer exists. Second, a stale or unrepresentative set: it was built from clean, happy-path cases and never stocked with the edge cases production actually sends — the "irrelevant or nonexistent input," "overly long input," and "ambiguous" cases Anthropic's eval-design guidance says to factor in and that demos never see. Third, leakage flattering the score: the eval set overlaps the prompt's own examples, or a case the model was effectively tuned on, so the set tests recall of something already seen instead of performance on something new — and the score is high for the wrong reason.

How to confirm it. Stop reasoning about it and go look at production. Sample real traces — a stratified handful across the traffic you actually serve, not the three that prompted the complaint — and grade them by hand with the same rubric the offline eval uses. Two outcomes, and they point opposite ways. If the sampled production traces score worse than the offline set, the gap is real and it's distribution or staleness — your set isn't testing what production sends. If the sampled traces score about the same as the offline set but users are still unhappy, the gap is in your criteria, not your set — you're measuring the wrong thing, and that's a what-is-evaluation problem (your success criteria don't capture what "good" means to a user), not a debugging-the-set one.

The fix. Feed the real failures back into the set. The production traces that scored badly are the most valuable cases you will ever add, because they encode the ways your system actually breaks in the wild — label them with their known-good outcome and they become permanent regression cases (eval datasets and metrics). This is the loop that closes the offline-online gap: production surfaces a failure, the failure becomes an offline case, and the next change that would re-introduce it now trips a red case in the lab instead of a complaint in the field. Do it continuously, not once — distribution keeps drifting, so a set that mirrors reality this quarter is stale by the next unless the traces keep flowing back in. And if the confirmation step pointed at leakage, the fix is a clean held-out split: cases the prompt has never seen, kept separate from anything the system was tuned on, so the score measures generalization instead of memory.

"The LLM-judge disagrees with humans"

How to recognize it. Your LLM-as-judge scores the output high, and the humans who read the same output don't agree — reviewers flag answers the judge passed, or pass answers the judge failed. The grader and the people it's supposed to stand in for have diverged, which means the judge's ten thousand scores are measuring something other than quality.

What's actually wrong. A judge that disagrees with humans is un-calibrated, and there are three usual reasons. First, a vague rubric: "rate the quality" gives the judge nothing concrete to check, so it drifts, and a drifting rubric produces a drifting score — Anthropic's bar is a rubric specific enough to be mechanical ("the answer must mention 'Acme Inc.' in the first sentence; if not, grade it incorrect"). Second, an un-calibrated judge in the literal sense: nobody ever checked it against human labels, so "the judge says 4 out of 5" was never confirmed to mean what a human means by 4 out of 5 — it's a number you decided to believe. Third, a judge bias that has nothing to do with the rubric: position bias (favoring whichever output came first in a comparison), verbosity bias (reading longer as better), or self-preference bias (favoring output in its own model's style) — the failure modes covered in LLM-as-judge, each of which makes the score look plausible while being wrong.

How to confirm it. Re-grade a human-labeled sample and measure the agreement directly. Take a set of cases humans have scored, run the judge on those same cases, and compare verdict-to-verdict — where do they diverge, and is the divergence patterned? The pattern names the cause: if the judge consistently rewards the longer answer, it's verbosity bias; if it flips when you swap the order of two compared outputs, it's position bias; if it favors the answer written in its own family's prose, it's self-preference; if the disagreement is scattered with no pattern, the rubric is too vague to grade consistently and the judge is guessing. The human-labeled sample is the instrument here — without it you can only suspect the judge is wrong, never confirm where.

The fix. Fix the rubric, not the humans. The reflex when a judge disagrees with people is to "recalibrate" by nudging the rubric toward the score you wanted — that's fitting the grader to a preconception, and it corrupts the eval. The discipline runs the other way: the humans are the source of truth, so you change the judge until it tracks them. Tighten the rubric until it's mechanical enough that a careful human and the judge would reach the same verdict; pin the scale so "4" and "5" have a stated, checkable difference instead of a vibe; grade with a different model than the one that produced the output (Anthropic's standing note — "best practice to use a different model to evaluate than the model used to generate the evaluated output" — which removes self-preference at the root); and for position bias, swap the order and only trust verdicts that survive the swap. Then re-check against the human sample. A judge is trustworthy only after it agrees with humans on cases you've actually verified — and "trustworthy" is not a one-time stamp, so re-calibrate when the rubric changes or the judged task drifts.

"A model (or prompt) upgrade tanked quality"

How to recognize it. Something changed — a model version bumped, a prompt got edited, a retriever got swapped — and quality dropped, but nothing obvious flagged it. The change looked safe, shipped quietly, and the degrade showed up downstream as worse output that nobody connected back to the upgrade. This is the silent regression, and what makes it dangerous is precisely that nothing turned red when it happened.

What's actually wrong. There was no regression gate. A change reached production without being scored against a frozen eval set first, so a quality drop that a single eval run would have caught instead shipped on faith and surfaced as a complaint. LangSmith names this category directly — "regression testing" and "backtesting" are offline-evaluation jobs, the comparison of a new version against a known baseline on a fixed set — and the whole point of an offline eval is that it runs before the change is live. A regression that reaches production is, by definition, a regression that skipped the gate.

How to confirm it. Run the frozen eval set against both versions and read the diff. Hold the set fixed and score the before and the after — LangSmith captures each run as an "experiment" and "supports comparing multiple experiments side-by-side," which is exactly the before-versus-after read you need — and the cases that dropped tell you where the regression lives. Then bisect the cause: if a model version and a prompt edit shipped together, you can't tell which one hurt until you separate them, so run the old model against the new prompt and the new model against the old prompt. Whichever pairing recovers the score isolates the culprit — the prompt change is fitted to the old model and didn't transfer, or the model upgrade reads the existing prompt differently. The frozen set is what makes the diff meaningful: change the set and the change at the same time and you can't attribute the drop to either, which is why the set has to be versioned and held constant across the comparison (eval datasets and metrics).

The fix. Make the eval set a pre-merge gate so this can't reach production again. The one-time fix is to re-tune against the set on the new version until the score recovers; the permanent fix is structural — the eval set runs automatically on every model bump, prompt edit, and retriever swap before it merges, and a drop below baseline blocks the change. That converts "the upgrade quietly made it worse, and a user found out for us" into "a red case blocked the merge, and we found out before anyone shipped." This is the regression net the whole subcategory points at: the agents you build, the retrieval you tune, the prompts you harden all stay correct only because a change that would break them trips a gate first. An eval set that doesn't gate anything is documentation; an eval set wired into the merge is the thing that stops the silent regression from ever being silent.

The throughline

Three ways an eval lies — offline-but-production-worse, judge-disagrees-with-humans, upgrade-tanked-quality — and one fact underneath all three: every one is cheaper to debug when the measurement was already in place. The offline-online gap closes only if you have production traces to sample and a set to feed them back into. The judge disagreement resolves only if you have a human-labeled sample to calibrate against. The silent regression gets caught only if a frozen eval set was already gating the merge. Debugging an eval is, almost entirely, a test of whether you built the measurement before you needed it — the team that did is sampling traces and reading a diff by lunch; the team that didn't is reconstructing what happened from memory and a complaint. That's why this subcategory front-loads the eval set and the traces: not as bureaucracy, but because they are the instruments you debug with, and you cannot install them after the failure you needed them for. Evaluate before you ship, trace everything, and a misleading eval becomes a thing you confirm and fix in an afternoon instead of a mystery you argue about for a week.

If an eval misled your team in a shape that isn't here, write to hello@wearecleon.com — we add it, with credit.

What is evaluation — the loop and the offline/online split these failures break
Eval datasets and metrics — the set you feed failures back into, version, and freeze for the gate
LLM-as-judge — the rubric, calibration, and biases behind the judge-disagrees flow
Evaluation gotchas — the failure modes this page operationalizes, each with the question to ask first
Agentforce testing and observability — where you run cases and read production for an agent on-platform
Tracing and monitoring — the production traces the confirmation step in every flow depends on
Debugging prompts — the same symptom-driven method one layer down, for the prompt itself
Debugging grounding — when the bad answer is a retrieval miss, not an eval one
AI Engineering principles — evaluate before you ship (4), trace everything (11), a demo is not a product (1)

Reference:

"It passes offline but production is worse"

"The LLM-judge disagrees with humans"

"A model (or prompt) upgrade tanked quality"

The throughline

Related