Eval datasets and metrics: the test set is the product spec

An evaluation is two things bolted together: a dataset of cases to run, and a grading method that scores the output on each case. Miss either half and you do not have an eval. A dataset with no grader is a pile of inputs; a grader with no dataset is an opinion with nowhere to point. This page builds both halves, because the test set — the cases plus how you score them — is the closest thing an AI feature has to a product spec. It is the written-down answer to "what does working mean here," and until it exists, "it works" is a feeling, not a claim.

This is where principle 3 stops being a slogan and turns into work. If you can't evaluate it, you can't ship it — and you cannot evaluate it without a set of cases and a way to grade them. Everything downstream in this subcategory — LLM-as-judge, tracing, regression, the monitoring that catches a silent degrade — runs on the dataset this page builds. Get the set right and the rest has something real to stand on. Get it wrong — too small, skewed away from production, graded by vibes — and every number it produces is confident and meaningless.

Building the eval set: mirror the real distribution, then add the edges

The first rule is the one teams skip because it sounds obvious: the eval set mirrors your real-world task distribution. Anthropic's first eval-design principle is "design evals that mirror your real-world task distribution" — if 70% of production traffic is short factual questions and 30% is multi-step requests, an eval set that is 90% multi-step lies to you in both directions. It under-tests the common case and over-weights the rare one, so a model that regresses on the bread-and-butter query can still post a green score. The dataset is a sample of reality; a biased sample produces a biased verdict.

The second rule is the one that catches the failures: deliberately include the edge cases. The same Anthropic guidance is explicit — "don't forget to factor in edge cases" — and names the kinds that matter: irrelevant or nonexistent input, overly long input, harmful or off-topic user input, and ambiguous cases where even humans would struggle to agree. These are exactly the inputs a happy-path demo never sees and production sees on day one. An eval set built only from clean, representative cases tells you the model handles the easy 80% — which you already suspected — and tells you nothing about the 20% that generates the support tickets. The edges are not noise to be cleaned out of the set; they are the point of having one.

How big: volume over hand-polish

The instinct is to hand-craft a few dozen perfect, hand-graded cases. Anthropic's third eval-design principle says the opposite, and it is worth quoting because it is counterintuitive: "More questions with slightly lower signal automated grading is better than fewer questions with high-quality human hand-graded evals." The reasoning is statistical. A handful of cases produces a number with enormous variance — three cases flipping from pass to fail can swing a 20-case eval by fifteen points, so you cannot tell a real regression from sampling noise. Hundreds of automatically graded cases give a number you can actually move and trust, and they run on every change for free. A slightly noisier grader applied to many cases beats a pristine grader applied to few, because volume is what buys you signal. This is also why the second eval-design principle — "automate when possible" — is not a convenience but a prerequisite: a set you can only grade by hand is a set you will grade once and never again, which makes it useless as a regression net. Writing hundreds of cases by hand is its own cost; Anthropic's own guidance is to seed a baseline set and have Claude generate more from it.

Grading methods: a method per case

Every case needs a way to turn an output into a score. There is no single grader for a whole set — you pick the method that fits each case's shape, and one eval set routinely uses several. Anthropic's rule for choosing is to reach for "the fastest, most reliable, most scalable method" the case allows, and to avoid human grading wherever an automated method will do. Here are the methods, what each is good for, and where each one bites:

Method	How it works	Good for	Watch out
Exact / string match	Compare the output to a known answer after normalizing case and whitespace (`output == golden`), or check a key phrase is present (`phrase in output`)	Categorical, clear-cut answers — sentiment labels, yes/no, a specific value that must appear	Brittle on free-form text: a correct answer phrased differently fails. Wrong tool for anything open-ended.
Code-graded	A deterministic function checks a property of the output — valid JSON, compiles, in range, passes a regex	Structured output, format contracts, anything with a checkable rule	Only as good as the rule. Checks the shape, not whether the content is right.
Multiple-choice	Constrain the task so the model picks from a fixed set, then check the choice	Classification and routing where you control the option set; cheapest reliable signal	Requires reshaping the task into options, which not every task allows.
Similarity / semantic	Embed output and reference, score how close they are (e.g. cosine similarity over sentence embeddings); ROUGE-L for summary overlap	Consistency and paraphrase — "are two answers semantically the same" even when wording differs	A score, not a verdict; you set the threshold. Needs an embedding model and reference text.
LLM-graded	A separate model scores the output against a rubric — correct/incorrect, or a 1-to-5 scale	Nuanced, subjective qualities — tone, faithfulness, helpfulness — that no rule captures	Must be validated before you trust it. Use a different model than the one under test, give it a detailed rubric, and make it output a discrete label.

Three details from Anthropic's grading guidance carry the LLM-graded row, because it is the one teams reach for too fast and trust too soon. First, the rubric is the grader — a vague rubric produces a vague, unstable score, so it must be concrete and empirical: "the answer must mention 'Acme Inc.' in the first sentence; if it does not, grade it incorrect." Second, make the judge output a discrete result — correct/incorrect or a fixed 1-to-5 scale, never a free-form paragraph you then have to parse. Third, let it reason, then discard the reasoning — asking the judge to think in <thinking> tags before it emits a <result> measurably improves the score on hard judgments. LLM-as-judge gets its own page in this subcategory; here it is one row in the table, the method you choose when no cheaper one fits the quality you are scoring.

Ground truth: where the right answer comes from, and what it costs

Most of these methods need a known-good answer to compare against — exact match needs the golden value, similarity needs the reference text, reference-based LLM grading needs the expected output. That known-good answer is ground truth (LangSmith calls it the reference output), and the uncomfortable fact is that someone has to produce it. It does not fall out of the sky. For a thousand-case sentiment set, that is a thousand human-labeled sentiments; for a summarization set, a reference summary per article. This labeling is the real cost of an eval set, and it is why the volume-over-hand-polish principle matters twice: you want enough cases for signal, but every case with a reference output is a case someone had to label.

Two things take the edge off the cost. First, not every method needs a reference — code-graded checks ("is this valid JSON," "is it under the length cap") and reference-free LLM grading ("does this response leak PII," judged against a policy, not an expected answer) score the output on its own properties, no golden answer required. Lean on those where the task allows; they are ground-truth-free by construction. Second, where you do need labels, seed a small set by hand and grow it — generate candidate cases, label the candidates rather than authoring each from scratch. What you must not do is let the model under test write its own ground truth; that is grading the exam with the answer key the student wrote, and it certifies nothing.

Versioning the eval set: so two runs are comparable

An eval number means nothing on its own. It only means something compared — to last week, to the prompt before the change, to the baseline you are trying to beat. That comparison is only valid if both runs ran against the same set. The moment you add ten cases, drop three, or fix a mislabeled answer, this week's score is no longer comparable to last week's, and a "regression" might just be a harder set. So you version the eval set. Treat the dataset as a versioned artifact, the way you treat code: a change to the cases is a commit, with a note on what changed and why. LangSmith builds this in — dataset versions are created automatically as examples change, and you can tag a version to mark a milestone — and even without that tooling the discipline is the same: pin the version a result was produced against, and when you change the set, say so, so nobody compares two numbers that were never measuring the same thing. This is principle 11 reaching into the eval set itself — trace everything — because a score you cannot tie to an exact set and version is a score you cannot defend.

One set, three consumers: where it plugs in

The toolkit composes here exactly as principle 7 says it should — compose the toolkit to the job — and the eval set is the shared artifact that makes the composition work. The cases and their ground truth are authored once; what changes is where you run them. The same set feeds three places:

The model layer — Anthropic's eval tooling. The Claude Console has a built-in Evaluation tool: an Evaluate tab in the prompt editor where your cases become test rows (added by hand, generated by Claude, or imported from CSV), and you grade responses on a 5-point scale, compare two prompt versions side by side, and re-run the whole suite against a new prompt version. It is the fastest path from "I changed the prompt" to "here is what that did across every case," and it is where a prompt-level change gets its first read. (It needs prompts written with {{variable}} placeholders so cases can vary the inputs.)
The off-platform harness — a LangSmith dataset. When the thing under test is not a single prompt but a chain, an agent, or a graph, the same cases live as a LangSmith dataset of examples — inputs plus reference outputs — graded by code, LLM-as-judge, human, or pairwise evaluators, with each run captured as an experiment you compare side by side across versions. This is the harness for the whole system, not just the model call.
The regression net for what already shipped. The same set is the safety net under every page already in this catalog. The agents you build in agents, the retrieval you tune in grounding, the prompts you harden in prompting — none of them stay correct by accident. The eval set is what you re-run when you change a model, edit a system prompt, or swap a retriever, so a fix in one place that quietly breaks another shows up as a red case instead of a customer complaint. Note that retrieval quality has its own dedicated eval set, query-to-chunk, separate from answer quality — see retrieval quality; the set this page builds grades the answer, and the two run side by side.

The same cases, the same ground truth, three harnesses. You do not maintain three eval sets; you maintain one and point it where the question is.

The discipline, restated

An eval is a dataset plus a grader, and the test set is the product spec for an AI feature — the written-down definition of working. Build the dataset to mirror the real distribution and stock it deliberately with edge cases, because the cases you leave out are the ones that break first. Make it big and automatically graded rather than small and hand-polished, because volume is what buys you signal you can trust. Grade each case with the cheapest reliable method that fits its shape, and when only an LLM judge will do, give it a concrete rubric, a discrete output, and a different model than the one under test. Source the ground truth deliberately and version the set like code, so two runs are actually comparable. Then point that one set at the Console Evaluation tool, at a LangSmith dataset, and at everything you have already shipped. Do that and principle 3 is satisfied: you can evaluate it, so you can ship it. Skip it and you are shipping on a feeling — which works right up until the input you never tested arrives.

AI Engineering principles — if you can't evaluate it, you can't ship it (3); trace everything (11); compose the toolkit (7)
What is context engineering — the prompting discipline the eval set is the regression net for
Structured output — code-graded validation is the eval-side mirror of validate-then-fall-back
Retrieval quality — the retrieval eval set (query-to-chunk), the analog this page's answer-side set runs beside
What is grounding — the grounded systems an eval set keeps honest
What is an agent — the agents an eval set is the regression net for
Evaluation Style Guide — when to reach for which grader
Debugging evals — when the dataset or the metric is the thing lying
Debugging agents — what you reach for when a case goes red and you need the trace

Reference: