Skip to main content

LLM-as-judge: grading output that has no single right answer

Exact match grades a sentiment label in one line. It cannot grade a support reply, a summary, or a conversational answer — open-ended output where two different wordings are both correct and there is no golden string to compare against. LLM-as-judge is the move there: a second model reads the output against a rubric and returns a score. The mechanic — the rubric is the scoring criteria, you pass it input plus output plus an optional reference, and you ask the judge to reason before it scores (Anthropic — improves judging on complex tasks). The feedback shapes: Boolean, Categorical, Continuous. The biases that make a naive judge lie — position, verbosity, self-preference — and the mitigations, ending on the one that matters most: calibrate the judge against human labels before you trust it. And it runs both ways — offline over an eval set, or online over live production traces.

Reference·Last updated 2026-06-05·Drafted by Lira · Edited by German Medina

Some output you can grade in one line of code. A sentiment label is positive, negative, or neutral; the answer is right or it isn't; output == golden_answer and you're done. That is exact match, and where it fits it is the fastest, most reliable grader there is. But most of what an AI system produces in production is not that shape. A support reply, a summary of a long document, a conversational answer, a drafted email — these are open-ended. Two different wordings can both be correct, there is no single golden string to compare against, and exact match grades every one of them as a failure because none matches the reference character-for-character. When the output has no single right answer, you need a grader that can judge meaning, not match text.

That grader is another model. LLM-as-judge is the technique: a second LLM reads the output against a rubric you wrote and returns a score. This page is the mechanic — what the rubric is, what you pass the judge, the reasoning boost that makes it more reliable, the biases that make a careless judge quietly wrong, and the discipline that has to come before you trust any of it. It is principle 4 made operational — evaluate before you ship — for the large class of output that code-based grading can't touch.

This is a reference for the technique across platforms. The judge model below is Claude (Anthropic), the evaluator harness is LangSmith's, and the in-platform equivalent is Agentforce's Outcome test. They compose by where the system runs; the discipline transfers, the exact knobs are each vendor's.

When to reach for it

Reach for a judge when the thing you're grading is open-ended and a deterministic check can't express "correct." Anthropic's grading guidance ranks the methods by exactly this trade: code-based grading is fastest and most reliable but "lacks nuance for more complex judgements"; human grading is the most flexible and highest quality but "slow and expensive"; LLM-based grading is "fast and flexible, scalable and suitable for complex judgement." A judge is the scalable stand-in for the human grader — you reach for it precisely when the answer needs human-like judgement but you have ten thousand of them to grade and can't pay a person to read each one.

So the rule is not "use a judge for everything." It's: if exact match or a string check can express the criterion — a label, a required phrase, a parseable field — use that, because it's cheaper and it never has an opinion. Save the judge for what those can't reach: tone, coherence, faithfulness to a source, whether a free-text answer actually addresses the question. That's the open-ended output where there is no golden string, and it's where a judge earns its cost.

The feedback shapes

A judge doesn't only return "good" or "bad." LangSmith's evaluator taxonomy gives three feedback types, and choosing the right one is the difference between a score you can act on and a number that means nothing:

Feedback typeWhat the judge returnsUse when
BooleanTrue / falseThe criterion is pass/fail — is the answer faithful to the source, yes or no
CategoricalOne value from a predefined setThe verdict is a labelled bucket — correct / partial / incorrect, or a named failure mode
ContinuousA number inside a stated rangeThe quality is a degree, not a gate — rate coherence from 1 to 5

Pick the coarsest shape that still captures what you care about. Boolean is the easiest to act on (it gates) and the easiest for a judge to be consistent about; a continuous 1-to-5 carries more information but a judge's "4" and "5" drift unless the rubric pins exactly what separates them. Anthropic's own guidance leans the same way — be "empirical or specific," instruct the judge to output only correct or incorrect, or to judge on a fixed scale, because "purely qualitative evaluations are hard to assess quickly and at scale."

The rubric is the scoring criteria

The rubric is not a nice-to-have around the judge — it is the judge's instructions. In LangSmith's words, the feedback configuration "is the scoring criteria that your LLM-as-a-judge evaluator will use. Think of this as the rubric that your evaluator will grade based on." A vague rubric ("rate the quality") produces a vague, drifting score. A specific one produces a score you can trust. Anthropic's example of a good rubric is exactly this concrete: "The answer should always mention 'Acme Inc.' in the first sentence. If it does not, the answer is automatically graded as 'incorrect.'"

Alongside the rubric, you decide what the judge sees. The evaluator is handed some combination of three things, mapped in as variables:

  • Input — the question or task the system was given. Needed when "correct" depends on what was asked.
  • Output — the thing being graded. Always passed; it's the subject.
  • Reference — a gold answer or source document to grade against, when one exists. Pass it for faithfulness or correctness checks; omit it for criteria like conciseness or tone that don't need a comparison.

You pass only what the criterion requires. Grading conciseness needs the output alone. Grading faithfulness needs the output plus the source it's supposed to be faithful to. Grading whether an answer addresses the question needs the input plus the output. Pass the wrong set — a reference the rubric never uses, or no input when correctness depends on it — and the judge grades on the wrong information.

Ask the judge to reason before it scores

The single highest-payoff move in writing a judge: make it think before it grades, not after. Anthropic states it directly — "Encourage reasoning: Ask the LLM to think first before deciding an evaluation score, and then discard the reasoning. This increases evaluation performance, particularly for tasks requiring complex judgement." The pattern in their grader prompt is to have the model reason in <thinking> tags and then emit the verdict in <result> tags, and you keep only the verdict.

The reason this works is the same reason it works for the system being graded: a score emitted cold is a snap judgement, and on anything subtle a snap judgement is noisy. Forcing the reasoning first makes the judge actually walk the rubric — check each criterion, weigh the output against it — before it commits to a number. You throw the reasoning away because you only need the score; you ask for it because the score is better for having been earned. On a simple Boolean over an obvious case it barely matters. On "is this multi-paragraph answer faithful to a long source," it's the difference between a judge that grades and a judge that guesses.

The biases — and the one mitigation that matters

A judge is a model, and a model has biases that have nothing to do with the rubric. Left unmanaged, they make the judge confidently wrong in ways that are easy to miss because the score looks fine. The known ones and their mitigations:

BiasWhat it isMitigation
Position biasWhen comparing two outputs, the judge favors whichever came first (or last) regardless of qualitySwap the order and judge again — or randomize position across the eval set — and only trust verdicts stable across the swap
Verbosity biasThe judge reads "longer" as "better" and rewards length over substanceMake the rubric score substance explicitly; penalize padding; don't let word count stand in for quality
Self-preference biasA judge tends to favor output written in its own model's styleGrade with a different model than the one that produced the output

That last mitigation is Anthropic's own working note, repeated through their eval examples: "Generally best practice to use a different model to evaluate than the model used to generate the evaluated output." A model grading its own family's prose is the conflict of interest baked into the simplest setup.

But knowing the biases isn't enough, because you can't see them from inside the judge — the score always looks plausible. The discipline that catches all of them at once is calibration against human labels: before you trust a judge at scale, have humans grade a sample, run the judge on that same sample, and check the judge agrees with the humans. If it doesn't, you fix the rubric — not the humans — and re-check. Only a judge that tracks human judgement on cases you've actually verified is a judge whose ten thousand unverified scores mean anything. LangSmith builds the feedback loop in: it lets you "collect human corrections on evaluator scores" to "better align the LLM-as-a-judge evaluator to human preferences," folding those corrections back into the judge as examples. A judge you never calibrated is a number generator you've decided to believe.

Offline and online: the same judge, two places

A judge isn't only for the lab. The same evaluator runs in two modes, and a mature system uses both:

  • Offline — the judge grades a fixed eval set: a curated collection of inputs with known-good references, run before you ship a prompt or model change. This is the regression gate — it tells you whether the change you're about to deploy made quality better or worse, on cases you control, before any customer sees it.
  • Online — the judge scores live production traces as they happen. LangSmith's online evaluators give "real-time feedback on your production traces," acting as "a scalable substitute for human-like judgement" on traffic no eval set anticipated. You filter which runs to score and set a sampling rate so you're not grading every single call, and the harness pulls the variables out of each trace and hands them to the judge.

The two answer different questions. Offline asks "is this version good enough to ship," against a frozen set. Online asks "is the shipped system still good right now," against real traffic — which is how you catch the silent degrade that an eval set, frozen by definition, never contains. The judge mechanic is identical; only the source of the input changes.

The composition

Three surfaces, composed by where the system runs — not three products you choose between:

  • Claude as the judge model (Anthropic) — the model doing the grading, with the "reason before scoring" pattern and the standing advice to judge with a different model than the one under test. This is the engine; it's the same engine whether the harness around it is LangSmith or your own script.
  • LangSmith's evaluator harness — the scaffolding that defines the rubric, maps input/output/reference into the judge, runs it offline over a dataset or online over production traces, and collects human corrections to calibrate it. This is where the judge lives when the system is built on a general LLM stack.
  • Agentforce's Outcome test — the judge-like check inside the platform. An Outcome test passes when the actual outcome's natural-language "gist" matches the expected outcome — a semantic comparison, not an exact-string one, which is precisely the open-ended-grading problem this whole page is about, solved natively for an agent built in Agentforce.

You don't pick one. A team running an Agentforce agent grounded on Data 360 uses Outcome tests where the agent lives; a team on a Claude-plus-LangGraph stack uses Claude-as-judge inside LangSmith; a team running both grades each system where it runs. The judge is the same idea everywhere — meaning-based grading of open-ended output — and the platform just decides which harness holds it.

The throughline

LLM-as-judge is the grader for everything exact match can't reach: open-ended output where two wordings are both right and there's no golden string. The rubric is the judge's instructions, so it has to be specific; you pass only the input, output, and reference the criterion needs; and you ask the judge to reason before it scores because an earned verdict beats a snap one on anything subtle. The feedback shape — Boolean, Categorical, Continuous — should be the coarsest one that still captures what you care about. The biases are real and invisible from the inside — position, verbosity, self-preference — and the mitigations help, but the one that actually makes a judge trustworthy is calibrating it against human labels before you believe its scores. Run it offline as a regression gate and online over live traffic, and the same judge that proves a change is safe to ship also catches the day it quietly stops working. A judge is how evaluation scales past the things you can grade with == — but only a calibrated judge is evaluation rather than a number you've decided to trust.

Related

Reference: