Evaluation Style Guide: the bar a change clears before it ships

This is the page where Cleon stops describing what evaluation is and says what we do before a change ships. The reference pages lay out the parts — what evaluation is and the loop it runs, the datasets and metrics that make the test set a product spec, and the LLM-as-judge that grades what a rule can't reach. The gotchas lay out the ten ways a measurement lies. This Style Guide is the discipline that decides what to measure, where, and how — and the gate a change has to clear before its output touches a customer or a record.

The rules are short on purpose. When a rule needs an explanation, the explanation lives in the page it links to. This is the operational form of the AI Engineering principles — each rule below is one of those principles with its sleeves rolled up, and we cite the number so you can trace the rule back to the reasoning. It is the measurement-side companion to the agent Style Guide, the grounding Style Guide, and the prompting Style Guide: each of those ends a sentence with "eval every change," and this is the page that sentence points to. The regression net all three hang from is built here.

The first decision: what do you measure, and where?

Before you pick a metric or a tool, place the system on the surface it actually runs on. Evaluation is not one product — like grounding and prompting before it, the surfaces compose, picked by where the system runs, never rival camps you choose between (principle 7). The throughline every Style Guide in this catalog invokes — eval every change — holds identically across all three; only the tooling differs. Match the situation to the surface:

The system is…	What you reach for	Why
An Agentforce agent inside the Salesforce security model	Agentforce Testing Center (pre-release) + Agentforce Observability (in production)	You evaluate it where it lives. Testing Center runs cases against the agent before release; Observability exports the production traces — turns, LLM calls, actions, and metric scores — in OpenTelemetry (OTLP) into Data 360 or any collector. The eval inherits the platform's security model and stays next to the data the agent acts on.
A Claude or LangGraph system running off-platform	LangSmith (datasets + offline + online eval over traces) — with Claude as the judge inside it	The off-platform agent's eval loop and its production monitoring live in one place: datasets of examples with reference outputs, an offline run before you ship, and online scoring over live traces. The judge model is Claude; the harness around it is LangSmith.
The model layer, regardless of harness	Anthropic's eval tooling + the Console Evaluation tool	When you're grading the raw model behavior — a prompt change, a model swap — you evaluate against Anthropic's define-criteria-then-grade loop, with the Console's Evaluate tab comparing two prompt versions side by side. This grades the model independent of where it's later embedded.

You don't pick one surface and pledge loyalty. An agent grounded on Data 360 and orchestrated off-platform might be tested in Testing Center, graded at the model layer with a Claude judge, and traced in LangSmith — each surface doing the part it's built for. The skill is composing them to the system you actually built. The depth of each surface is a sibling page: Agentforce testing and observability for the platform, tracing and monitoring for the trace the online side depends on.

Offline and online: before you ship, and after

Evaluation happens in two places, and they answer different questions. Both belong in a mature system — they are not a choice between, they compose (principle 11 — you can't evaluate what you can't replay). Skip the first and you ship blind; skip the second and a customer tells you about the degrade.

	Offline	Online
Runs	Before you ship	After you ship
Against	A frozen eval set with known-good answers	Live production traces, no reference answer
The question	Is this change safe to ship?	Is the shipped system still working right now?
Grader	Compares output to ground truth	Reference-free — judges the live response on its own properties
Catches	A regression, before any user sees it	The silent drift a frozen set can't contain

The honest division of labor: offline proves a change is safe to ship; online proves it stayed good once it shipped. A green offline eval is necessary and not sufficient — production runs on a distribution you didn't curate, and a system that aces the offline suite can still degrade live. See what is evaluation for the loop behind both, and tracing and monitoring for the trace the online side runs on.

How to grade: deterministic, judge, or human

Every case needs a way to turn an output into a score. There is no single grader for a whole set — you pick the method that fits each case's shape, and one eval set routinely uses several. The rule is one line: use the cheapest grader that can express the criterion. Deterministic where the answer is exact, a model where it needs judgment you can spec in a rubric, a human where it needs judgment you can't.

Method	Good for	Cost
Deterministic metric	Clear-cut answers a rule can express — a category, a field, valid JSON, a number in range, a required phrase present	Cheapest, fastest, perfectly repeatable. Its limit is nuance: it checks a match, not whether prose is good.
LLM-as-judge	Open-ended output a rule can't reach — tone, coherence, faithfulness to a source — graded against a rubric you write	Cheap and scalable, but the judge is itself a model that must be calibrated before you trust it. A judge you didn't check is a number you decided to believe.
Human	The source of truth the other two are validated against — building the golden set, calibrating a judge, adjudicating genuinely ambiguous cases	Highest quality, highest cost. Spend it where it earns its place; never the method you scale to thousands of runs.

Anthropic's own guidance ranks the methods by exactly this trade — code-based is fastest and most reliable but "lacks nuance for more complex judgements"; human grading is most flexible and highest-quality but "slow and expensive"; LLM-based grading is "fast and flexible, scalable and suitable for complex judgement." The depth of building each one well lives in eval datasets and metrics (a method per case) and LLM-as-judge (the model-graded case, and the calibration that makes it trustworthy). Pick the cheapest judge that can actually answer the question this case asks — and keep human judgement underneath, because the other two are only as trustworthy as the ground truth a person established.

The "eval every change" gate

This is the page that gives the discipline its home. Every other Style Guide in this catalog ends a rule with "eval every change" and points here — this is what they point to. Before a prompt, model, agent, retriever, or tool change ships, every box below is true, or the change isn't ready. Each one closes a gotcha that turned a green dashboard into a system that was quietly worse.

A regression set exists, and it's held out. There is a set of real cases the change is scored against — built from the failures that actually bit you, not invented at a desk — and not one of those cases appears in the prompt's examples, the tuning data, or anything the system retrieves at run time. A set that leaked into what the system learned measures memory, not capability. (Principle 3 · gotchas 1, 6.)
It ran on this change. The regression set reran automatically on this edit — the prompt, the model version, the tool, the retriever — before the change shipped, not after a customer found the break. If a change to any part of the system doesn't trip the set, the set isn't a gate. (Principle 3 · gotcha 6.)
The baseline is beaten or held. You have a recorded score for the version you're replacing, and this change scores at least as well on real cases — not just looks better on the one input you tried. "Better" with no measured before is a feeling, and the set is big enough that a few cases going either way can't swing the verdict. (Principle 3 · gotchas 4, 5.)
The judge is calibrated. If an LLM-as-judge produced any of these scores, it was checked against human labels on a sample first, anchored to an explicit rubric, asked to reason before scoring, and graded with a different model than the one under test. An uncalibrated judge grades on length and order, and you'd never see it from inside the score. (Principle 4 · gotcha 3.)
The eval set is versioned. The set is pinned like code, and every score records the version it ran against, so last month's number and this month's are actually comparable. A ruler whose markings move can't detect change — the one thing it exists to do. (Principle 11 · gotcha 9.)
Online monitoring is on. The shipped system's traces are scored, not just collected — metrics and judgments attached so a quality drop trips an alert instead of waiting for a support spike. Offline cleared the change; online confirms it stayed good against traffic the set never contained. (Principle 11 · gotchas 7, 10.)

If any box is unchecked, the change isn't ready — and an evaluation failure looks exactly like a passing one until you act on the number. See debugging evals for how to find which part of the measurement lied when the score and reality disagree.

Patterns to prefer

Measure, don't sample — turn "it looked good in three tries" into a score over a real set with a rubric; a sample of three you chose yourself is an impression, not a test. (Principle 3 · gotcha 8.)
Cheapest grader that fits — deterministic where the answer is exact, a judge where it needs spec-able judgment, a human where it doesn't; don't pay a model to check what == can. (Principle 4 · gotcha 3.)
Eval set from real failures — the first ten cases that bit you in testing are worth more than a hundred synthetic ones; stock the set deliberately with the edges, because the cases you leave out break first. (Principle 3 · gotcha 6.)
Measure on more than one axis — pair task fidelity with a separate safety or latency check so no single number can be gamed in isolation; a metric optimized alone climbs by abandoning the goal. (Principle 3 · gotcha 2.)
Calibrate the judge before you trust it — check a model judge against human labels on a sample, anchor it to a rubric, grade with a different model; only a calibrated judge is evaluation rather than a number you've decided to believe. (Principle 4 · gotcha 3.)
Score the traces, don't just keep them — attach metrics to the production stream so a degrade trips an alert; raw logs with nothing computed over them are forensic material, not a monitor. (Principle 11 · gotcha 10.)

Patterns to refuse

Shipping a change unscored — "it worked in three tries" is a vibe, and the regression ships silently next to the fix; change one thing, score it against the set, keep what wins. (Gotcha 8 · principle 3.)
Claiming "better" with no baseline — an improvement nobody measured is a release you can't defend, and three sprints later nobody can say whether the system is better or worse than where it started. (Gotcha 4 · principle 3.)
Trusting an uncalibrated judge — "rate this 1 to 5" with no rubric grades on verbosity and order, and the bias is baked into every decision the eval drove. (Gotcha 3 · principle 4.)
Optimizing one number to death — make a single metric the only target and the system climbs it by the cheapest route, including the routes that abandon the goal the number stood for. (Gotcha 2 · principle 3.)
A ruler that moves under the measurement — editing the eval set without a version makes last month's score incomparable to this one, so a "regression" might just be a harder set. (Gotcha 9 · principle 11.)
Logging everything, measuring nothing — a warehouse of unscored traces is a haystack you search by hand after an incident, not a monitor that warns you before one. (Gotcha 10 · principle 11.)

Closing

None of these rules is hard to apply up front; all of them are expensive to discover live, because an evaluation failure looks exactly like a passing one until someone acts on the number. The throughline is the one that runs through every page in this subcategory: a measurement you can trust is held out from what the system learned, scored on more than one axis, judged by a calibrated judge, compared against a baseline, sized so noise can't swing it, versioned so it stays comparable, and attached to the traces it watches — or it is a number that looks like proof and lies, which is worse than no number, because you ship on it. The eval set is the regression net the agents, grounding, and prompting work all hang from; a hole in that net isn't local to evaluation, it's a hole under everything upstream. Principle 3, taken seriously: if you can't evaluate it, you can't ship it.

If you spot a rule missing — or one of these rules being violated in our public work — write to hello@wearecleon.com. We add it, or we fix it and we say so.

Evaluation gotchas — the ten ways a measurement lies that this Style Guide is designed to prevent
What is evaluation — the eval loop, offline versus online, and the vocabulary behind the first decision
Eval datasets and metrics — the test set as product spec, and a grading method per case
LLM-as-judge — grading open-ended output, and the calibration behind "trust a judge"
Agentforce testing and observability — the platform surface from the first-decision table
Tracing and monitoring — the trace the online half of the gate runs on
Debugging evals — finding which part of the measurement lied when score and reality disagree
Agent Style Guide — the agent-side anchor whose changes this gate scores
Grounding Style Guide — the retrieval-side anchor whose changes this gate scores
Prompting Style Guide — the prompt-side anchor whose "eval every change" points here
AI Engineering principles — the meta-rules these specifics operationalize
Marketing Cloud AI Style Guide — the which-surface decision, one layer out

Reference: