Skip to main content

What is evaluation? Measuring whether the system works, instead of hoping

Evaluation is the discipline of measuring whether an AI system does its job — replacing 'it looked good in three tries' with a number you can score, compare, and defend. The eval loop: define success criteria, build an eval set, grade, iterate. Offline evaluation before you ship versus online evaluation on live traffic. The vocabulary the rest of this subcategory uses — eval set, golden dataset, ground truth, metric, judge, baseline, regression. And the three ways to grade — deterministic metric, LLM-as-judge, human — with when each fits. Principle 3: if you can't evaluate it, you can't ship it.

Reference·Last updated 2026-06-05·Drafted by Lira · Edited by German Medina

A model that gets the answer right three times in a row has told you almost nothing. You ran it three times, on inputs you happened to pick, and it looked good — so it ships, and the first time a real user phrases the question a way you never tried, it confidently returns something wrong and nobody knows until they complain. This is the vibe-check trap, and almost every AI system that breaks in production broke here first: it was never measured, only sampled, and a sample of three you chose yourself is an impression, not a test. Evaluation is the discipline that replaces the impression with a measurement — a number you can score, watch over time, and defend when someone asks "how do you know it works?"

This is principle 3 — if you can't evaluate it, you can't ship it — and it is the spine of this whole subcategory. The reframe is the same one grounding makes for facts and context engineering makes for the window: stop trusting how the output feels and start measuring what it does. Without an eval, every prompt change is a coin flip you can't score and every regression ships silently. This page lays out the loop that fixes that, draws the line between testing before you ship and watching after, names the vocabulary the rest of the subcategory leans on, and frames the three ways you can grade an answer.

The eval loop

Evaluation is not a one-time gate you pass before launch; it is a loop you run on every change. It has four steps, and the discipline is running all four rather than stopping at the first that feels done:

  1. Define success criteria — decide what "working" means before you measure, in terms specific and measurable enough to score. "Good answers" is not a criterion; "classifies the ticket into the right category" or "cites a real source for every claim" is. Anthropic's guidance is the bar here: criteria should be specific, measurable, achievable, and relevant — vague goals produce vague evals that tell you nothing.
  2. Build an eval set — assemble the inputs you will test against, each paired with what a right answer looks like. This is the asset the whole loop turns on, and the most valuable version of it is built from real failures as they happen, not invented at a desk (principle 3 — the first ten cases that bit you in testing are worth more than a hundred synthetic ones).
  3. Grade — run the system over the eval set and score each output against its criterion. How you grade is the decision the back half of this page frames: a deterministic check, another model as judge, or a human.
  4. Iterate — change one thing — the prompt, the retrieval, the model — re-run the eval, and compare the score to where it was. The number moved or it didn't; that is the signal a vibe check could never give you.

The loop is the point. A single grading run tells you where you are today; the loop tells you whether a change made things better or worse, which is the only question that matters once a system is live and you are improving it under pressure.

Offline and online: before you ship, and after

Evaluation happens in two places, and they answer different questions. Both belong in a serious system — they are not a choice between, they compose.

Offline evaluation runs before you ship, against a fixed eval set with known right answers. It is the pre-deployment test: you change the prompt, run the eval set, and see whether the score went up or down before any user is exposed. Because the eval set is curated and its answers are known, offline eval is where regression testing lives — proving a new version is at least as good as the last one. This is the loop above, run in a lab.

Online evaluation runs after you ship, on live production traffic, where there is no pre-written right answer to compare against. Here you measure what you can without ground truth: signals like whether the user accepted the answer, escalated to a human, or rephrased and tried again, plus quality checks a judge can apply to a live response in flight. Online eval is how you catch the silent degrade — the slow drift where a system that passed every offline test starts getting worse in the wild because the traffic shifted or the model underneath moved. It depends on the trace of what actually happened in production, which is principle 11 — you can't debug, or evaluate, what you can't replay — and its depth is a sibling page on tracing and monitoring.

The honest division of labor: offline eval proves a change is safe to ship; online eval proves it stayed good once it shipped. Skip the first and you ship blind; skip the second and you find out about the degrade from a customer.

The vocabulary

This subcategory leans on a small, precise set of terms. Here they are once, plainly; you will meet each in depth as the subcategory goes on.

TermWhat it means
Eval setThe collection of test inputs you run the system against, each paired with the expected result. The asset the whole loop turns on.
Golden datasetA curated, trusted eval set whose answers are known to be correct — the held-out cases you measure against, kept stable so scores are comparable over time.
Ground truthThe known-correct answer for a given input — the "right answer" a graded output is compared to. The thing a golden dataset is made of.
MetricA quantified measure of performance — accuracy, a pass rate, a 1-to-5 score, response time. The number a grade produces.
JudgeWhat does the grading. A deterministic check, another model (LLM-as-judge), or a human — each scoring an output against its criterion.
BaselineThe score you are comparing against — usually the current version of the system. A change is an improvement only relative to a baseline.
RegressionA change that makes the score worse — a case that used to pass and now fails. The thing offline eval exists to catch before it ships.

Each of these gets a page or a section of its own as the subcategory goes on. Here they are just the words, so the rest reads cleanly.

How you grade: three ways, and when each fits

The hardest decision in evaluation is not whether to grade but how — what plays the role of judge. There are three options, they trade off the same way every time, and a real eval set usually mixes them depending on the question each case asks. This page frames the choice; the depth of building each judge well lives in sibling pages — a deterministic check is the cheapest, the [Evaluation Style Guide] sets the bar a judge clears before you trust its scores, and the LLM-as-judge page goes deep on the model-graded case.

  • Deterministic metric — code checks the output against a rule: exact match against ground truth, a key phrase present, valid JSON, a number in range. Fastest, cheapest, perfectly repeatable, and the right choice whenever the answer is clear-cut — a category, a field, a yes or no. Its limit is nuance: it cannot judge whether prose is good, only whether it matches.
  • LLM-as-judge — another model grades the output against a rubric you write, returning a score or a verdict. This is how you grade the things a rule can't reach — coherence, tone, whether an answer is faithful to its sources — at a speed and cost a human can't match. The catch is that the judge is itself a non-deterministic model that must be evaluated before you trust it; a rubric that is vague produces a judge that is unreliable, and a judge you haven't checked is just a second opinion you're hoping is right.
  • Human — a person reads the output and scores it. The most flexible and the highest-quality signal, and the source of truth the other two are ultimately validated against — but slow and expensive, so you spend it where it earns its place: building the golden dataset, calibrating an LLM judge, and adjudicating the cases that are genuinely ambiguous. Not the method you scale to thousands of runs.

The pattern across all three: pick the cheapest judge that can actually answer the question this case asks. Deterministic where the answer is exact, a model where it needs judgment you can spec in a rubric, a human where it needs judgment you can't — and human-validated underneath, because the other two are only as trustworthy as the ground truth a person established.

The spine: three surfaces, composed by where the system runs

Evaluation is not one tool. Like grounding and prompting before it, the surfaces compose — they are complementary instruments an engineer picks by where the system runs, never rival products to choose between. The throughline every Style Guide in this catalog already invokes — eval every change — holds identically across all three; only the tooling differs.

  • Agentforce Testing Center and observability — when the agent runs on Agentforce inside the Salesforce security model, you evaluate it where it lives: a testing surface for running cases against the agent before release, and observability for what it does in production, with traces that can flow out through OpenTelemetry into Data 360. The eval inherits the platform's security model and stays next to the data the agent acts on.
  • Anthropic eval tooling and LLM-as-judge — at the model layer, when you are calling Claude directly, you evaluate against the define-criteria-then-grade loop Anthropic documents, with LLM-as-judge as the grading method for everything a deterministic check can't reach. This is the layer that grades the raw model behavior, independent of where it is later embedded.
  • LangSmith — when the system runs off-platform, on a custom control loop, LangSmith is the evaluation and tracing surface: datasets of examples, offline eval before you ship, and online eval over live traces. It is where an off-platform agent's eval loop and its production monitoring live in one place.

You do not pick one and pledge loyalty. An agent grounded on Data 360 and orchestrated off-platform might be tested in Testing Center, graded at the model layer with an LLM judge, and traced in LangSmith — each surface doing the part it is built for. The skill is composing them to the system you actually built, which is the same toolkit-composition logic that runs through the AI Engineering principles.

Where to go next

From here, the subcategory builds outward from the loop this page framed. The asset the whole thing turns on — how to build an eval set and choose the metrics that score it — is the eval datasets and metrics page. The model-graded judge, the one that reaches the things a rule can't, gets its own depth in the LLM-as-judge page. The platform surfaces — testing and observability inside Salesforce — are the Agentforce testing and observability page, and the trace that online eval depends on is tracing and monitoring. When an eval itself misleads you — a judge that scores wrong, a set that drifted from reality — that is debugging evals. And the bar an eval clears before you trust it is the Evaluation Style Guide. (Those siblings are landing alongside this page; named here, linked once they ship.)

Evaluation does not stand alone. It is the measurement layer under everything else this catalog covers: it scores whether grounding actually retrieved the right facts, whether a prompt or context change improved the output, and whether an agent does its job reliably enough to ship. The thing being evaluated changes; the loop does not.

Related

Reference: