Evaluation Style Guide: the bar a change clears before it ships
The opinionated rules Cleon applies to every evaluation — the first decision (what to measure and where), offline versus online, how to grade (deterministic, LLM-as-judge, or human), and the 'eval every change' gate every other Style Guide in this catalog invokes. The discipline document that turns the evaluation gotchas into a checklist and principle 3 into practice: if you can't evaluate it, you can't ship it — and a measurement you can't defend is worse than none, because you act on it. The page that gives 'eval every change' its home, and composes Agentforce Testing Center, Anthropic eval tooling, and LangSmith by where the system runs rather than picks a camp.
This is the page where Cleon stops describing what evaluation is and says what we do before a change ships. The reference pages lay out the parts — what evaluation is and the loop it runs, the datasets and metrics that make the test set a product spec, and the LLM-as-judge that grades what a rule can't reach. The gotchas lay out the ten ways a measurement lies. This Style Guide is the discipline that decides what to measure, where, and how — and the gate a change has to clear before its output touches a customer or a record.
The rules are short on purpose. When a rule needs an explanation, the explanation lives in the page it links to. This is the operational form of the AI Engineering principles — each rule below is one of those principles with its sleeves rolled up, and we cite the number so you can trace the rule back to the reasoning. It is the measurement-side companion to the agent Style Guide, the grounding Style Guide, and the prompting Style Guide: each of those ends a sentence with "eval every change," and this is the page that sentence points to. The regression net all three hang from is built here.
The first decision: what do you measure, and where?
Before you pick a metric or a tool, place the system on the surface it actually runs on. Evaluation is not one product — like grounding and prompting before it, the surfaces compose, picked by where the system runs, never rival camps you choose between (principle 7). The throughline every Style Guide in this catalog invokes — eval every change — holds identically across all three; only the tooling differs. Match the situation to the surface:
| The system is… | What you reach for | Why |
|---|---|---|
| An Agentforce agent inside the Salesforce security model | Agentforce Testing Center (pre-release) + Agentforce Observability (in production) | You evaluate it where it lives. Testing Center runs cases against the agent before release; Observability exports the production traces — turns, LLM calls, actions, and metric scores — in OpenTelemetry (OTLP) into Data 360 or any collector. The eval inherits the platform's security model and stays next to the data the agent acts on. |
| A Claude or LangGraph system running off-platform | LangSmith (datasets + offline + online eval over traces) — with Claude as the judge inside it | The off-platform agent's eval loop and its production monitoring live in one place: datasets of examples with reference outputs, an offline run before you ship, and online scoring over live traces. The judge model is Claude; the harness around it is LangSmith. |
| The model layer, regardless of harness | Anthropic's eval tooling + the Console Evaluation tool | When you're grading the raw model behavior — a prompt change, a model swap — you evaluate against Anthropic's define-criteria-then-grade loop, with the Console's Evaluate tab comparing two prompt versions side by side. This grades the model independent of where it's later embedded. |
You don't pick one surface and pledge loyalty. An agent grounded on Data 360 and orchestrated off-platform might be tested in Testing Center, graded at the model layer with a Claude judge, and traced in LangSmith — each surface doing the part it's built for. The skill is composing them to the system you actually built. The depth of each surface is a sibling page: Agentforce testing and observability for the platform, tracing and monitoring for the trace the online side depends on.
Offline and online: before you ship, and after
Evaluation happens in two places, and they answer different questions. Both belong in a mature system — they are not a choice between, they compose (principle 11 — you can't evaluate what you can't replay). Skip the first and you ship blind; skip the second and a customer tells you about the degrade.
| Offline | Online | |
|---|---|---|
| Runs | Before you ship | After you ship |
| Against | A frozen eval set with known-good answers | Live production traces, no reference answer |
| The question | Is this change safe to ship? | Is the shipped system still working right now? |
| Grader | Compares output to ground truth | Reference-free — judges the live response on its own properties |
| Catches | A regression, before any user sees it | The silent drift a frozen set can't contain |
The honest division of labor: offline proves a change is safe to ship; online proves it stayed good once it shipped. A green offline eval is necessary and not sufficient — production runs on a distribution you didn't curate, and a system that aces the offline suite can still degrade live. See what is evaluation for the loop behind both, and tracing and monitoring for the trace the online side runs on.
How to grade: deterministic, judge, or human
Every case needs a way to turn an output into a score. There is no single grader for a whole set — you pick the method that fits each case's shape, and one eval set routinely uses several. The rule is one line: use the cheapest grader that can express the criterion. Deterministic where the answer is exact, a model where it needs judgment you can spec in a rubric, a human where it needs judgment you can't.
| Method | Good for | Cost |
|---|---|---|
| Deterministic metric | Clear-cut answers a rule can express — a category, a field, valid JSON, a number in range, a required phrase present | Cheapest, fastest, perfectly repeatable. Its limit is nuance: it checks a match, not whether prose is good. |
| LLM-as-judge | Open-ended output a rule can't reach — tone, coherence, faithfulness to a source — graded against a rubric you write | Cheap and scalable, but the judge is itself a model that must be calibrated before you trust it. A judge you didn't check is a number you decided to believe. |
| Human | The source of truth the other two are validated against — building the golden set, calibrating a judge, adjudicating genuinely ambiguous cases | Highest quality, highest cost. Spend it where it earns its place; never the method you scale to thousands of runs. |
Anthropic's own guidance ranks the methods by exactly this trade — code-based is fastest and most reliable but "lacks nuance for more complex judgements"; human grading is most flexible and highest-quality but "slow and expensive"; LLM-based grading is "fast and flexible, scalable and suitable for complex judgement." The depth of building each one well lives in eval datasets and metrics (a method per case) and LLM-as-judge (the model-graded case, and the calibration that makes it trustworthy). Pick the cheapest judge that can actually answer the question this case asks — and keep human judgement underneath, because the other two are only as trustworthy as the ground truth a person established.
The "eval every change" gate
This is the page that gives the discipline its home. Every other Style Guide in this catalog ends a rule with "eval every change" and points here — this is what they point to. Before a prompt, model, agent, retriever, or tool change ships, every box below is true, or the change isn't ready. Each one closes a gotcha that turned a green dashboard into a system that was quietly worse.
- A regression set exists, and it's held out. There is a set of real cases the change is scored against — built from the failures that actually bit you, not invented at a desk — and not one of those cases appears in the prompt's examples, the tuning data, or anything the system retrieves at run time. A set that leaked into what the system learned measures memory, not capability. (Principle 3 · gotchas 1, 6.)
- It ran on this change. The regression set reran automatically on this edit — the prompt, the model version, the tool, the retriever — before the change shipped, not after a customer found the break. If a change to any part of the system doesn't trip the set, the set isn't a gate. (Principle 3 · gotcha 6.)
- The baseline is beaten or held. You have a recorded score for the version you're replacing, and this change scores at least as well on real cases — not just looks better on the one input you tried. "Better" with no measured before is a feeling, and the set is big enough that a few cases going either way can't swing the verdict. (Principle 3 · gotchas 4, 5.)
- The judge is calibrated. If an LLM-as-judge produced any of these scores, it was checked against human labels on a sample first, anchored to an explicit rubric, asked to reason before scoring, and graded with a different model than the one under test. An uncalibrated judge grades on length and order, and you'd never see it from inside the score. (Principle 4 · gotcha 3.)
- The eval set is versioned. The set is pinned like code, and every score records the version it ran against, so last month's number and this month's are actually comparable. A ruler whose markings move can't detect change — the one thing it exists to do. (Principle 11 · gotcha 9.)
- Online monitoring is on. The shipped system's traces are scored, not just collected — metrics and judgments attached so a quality drop trips an alert instead of waiting for a support spike. Offline cleared the change; online confirms it stayed good against traffic the set never contained. (Principle 11 · gotchas 7, 10.)
If any box is unchecked, the change isn't ready — and an evaluation failure looks exactly like a passing one until you act on the number. See debugging evals for how to find which part of the measurement lied when the score and reality disagree.
Patterns to prefer
- Measure, don't sample — turn "it looked good in three tries" into a score over a real set with a rubric; a sample of three you chose yourself is an impression, not a test. (Principle 3 · gotcha 8.)
- Cheapest grader that fits — deterministic where the answer is exact, a judge where it needs spec-able judgment, a human where it doesn't; don't pay a model to check what
==can. (Principle 4 · gotcha 3.) - Eval set from real failures — the first ten cases that bit you in testing are worth more than a hundred synthetic ones; stock the set deliberately with the edges, because the cases you leave out break first. (Principle 3 · gotcha 6.)
- Measure on more than one axis — pair task fidelity with a separate safety or latency check so no single number can be gamed in isolation; a metric optimized alone climbs by abandoning the goal. (Principle 3 · gotcha 2.)
- Calibrate the judge before you trust it — check a model judge against human labels on a sample, anchor it to a rubric, grade with a different model; only a calibrated judge is evaluation rather than a number you've decided to believe. (Principle 4 · gotcha 3.)
- Score the traces, don't just keep them — attach metrics to the production stream so a degrade trips an alert; raw logs with nothing computed over them are forensic material, not a monitor. (Principle 11 · gotcha 10.)
Patterns to refuse
- Shipping a change unscored — "it worked in three tries" is a vibe, and the regression ships silently next to the fix; change one thing, score it against the set, keep what wins. (Gotcha 8 · principle 3.)
- Claiming "better" with no baseline — an improvement nobody measured is a release you can't defend, and three sprints later nobody can say whether the system is better or worse than where it started. (Gotcha 4 · principle 3.)
- Trusting an uncalibrated judge — "rate this 1 to 5" with no rubric grades on verbosity and order, and the bias is baked into every decision the eval drove. (Gotcha 3 · principle 4.)
- Optimizing one number to death — make a single metric the only target and the system climbs it by the cheapest route, including the routes that abandon the goal the number stood for. (Gotcha 2 · principle 3.)
- A ruler that moves under the measurement — editing the eval set without a version makes last month's score incomparable to this one, so a "regression" might just be a harder set. (Gotcha 9 · principle 11.)
- Logging everything, measuring nothing — a warehouse of unscored traces is a haystack you search by hand after an incident, not a monitor that warns you before one. (Gotcha 10 · principle 11.)
Closing
None of these rules is hard to apply up front; all of them are expensive to discover live, because an evaluation failure looks exactly like a passing one until someone acts on the number. The throughline is the one that runs through every page in this subcategory: a measurement you can trust is held out from what the system learned, scored on more than one axis, judged by a calibrated judge, compared against a baseline, sized so noise can't swing it, versioned so it stays comparable, and attached to the traces it watches — or it is a number that looks like proof and lies, which is worse than no number, because you ship on it. The eval set is the regression net the agents, grounding, and prompting work all hang from; a hole in that net isn't local to evaluation, it's a hole under everything upstream. Principle 3, taken seriously: if you can't evaluate it, you can't ship it.
If you spot a rule missing — or one of these rules being violated in our public work — write to hello@wearecleon.com. We add it, or we fix it and we say so.
Related
- Evaluation gotchas — the ten ways a measurement lies that this Style Guide is designed to prevent
- What is evaluation — the eval loop, offline versus online, and the vocabulary behind the first decision
- Eval datasets and metrics — the test set as product spec, and a grading method per case
- LLM-as-judge — grading open-ended output, and the calibration behind "trust a judge"
- Agentforce testing and observability — the platform surface from the first-decision table
- Tracing and monitoring — the trace the online half of the gate runs on
- Debugging evals — finding which part of the measurement lied when score and reality disagree
- Agent Style Guide — the agent-side anchor whose changes this gate scores
- Grounding Style Guide — the retrieval-side anchor whose changes this gate scores
- Prompting Style Guide — the prompt-side anchor whose "eval every change" points here
- AI Engineering principles — the meta-rules these specifics operationalize
- Marketing Cloud AI Style Guide — the which-surface decision, one layer out
Reference: