AI ENGINEERING / EVALUATION & OBSERVABILITY

Evaluation & Observability

Knowing it works — and stays working: evals, test sets, LLM-as-judge, tracing, regression, and the monitoring that catches a silent degrade before a customer does.

Foundation · 2

Production note
Evaluation gotchas: how a measurement lies to you
An eval is supposed to be the one thing in an AI system you can trust — the number that tells you the prompt got better, the agent didn't regress, the new model is safe to ship. But an eval can lie: it can measure memorization instead of capability, optimize a number while losing the goal, score with a biased judge, or pass offline and fail live. Ten gotchas that make a measurement look like proof when it isn't, each with the question to answer first and the cost of trusting the wrong number.
Decision framework
Evaluation Style Guide: the bar a change clears before it ships
The opinionated rules Cleon applies to every evaluation — the first decision (what to measure and where), offline versus online, how to grade (deterministic, LLM-as-judge, or human), and the 'eval every change' gate every other Style Guide in this catalog invokes. The discipline document that turns the evaluation gotchas into a checklist and principle 3 into practice: if you can't evaluate it, you can't ship it — and a measurement you can't defend is worse than none, because you act on it. The page that gives 'eval every change' its home, and composes Agentforce Testing Center, Anthropic eval tooling, and LangSmith by where the system runs rather than picks a camp.

Reference · 5

How-to · 1

How-to
Debugging evals: when the number lies, and how to confirm it
The eval said green and production is worse. Or the judge scores high and your reviewers disagree. Or a model upgrade you couldn't see tanked quality. A misleading eval is worse than no eval — it's a green check you trusted. The symptom-driven playbook for three ways an eval lies: offline passes but production is worse (distribution shift, a stale set, leakage flattering the score), the LLM-judge disagrees with humans (a vague rubric, an un-calibrated judge, a position/verbosity/self-preference bias), and a model-or-prompt upgrade silently regressed quality with no gate to catch it. Each with the symptom, what's actually wrong, how to confirm it, and the fix. The throughline: every one of these is cheaper to debug when you already had the eval set and the traces — debugging an eval is mostly 'did you have the measurement before you needed it.'

Evaluation & Observability

Foundation · 2

Evaluation gotchas: how a measurement lies to you

Evaluation Style Guide: the bar a change clears before it ships

Reference · 5

What is evaluation? Measuring whether the system works, instead of hoping

Eval datasets and metrics: the test set is the product spec

LLM-as-judge: grading output that has no single right answer

Agentforce testing and observability: evaluating the agent where it lives

Tracing and monitoring: catching the degrade an eval set can't see

How-to · 1

Debugging evals: when the number lies, and how to confirm it