AI ENGINEERING / EVALUATION & OBSERVABILITY
Evaluation & Observability
Knowing it works — and stays working: evals, test sets, LLM-as-judge, tracing, regression, and the monitoring that catches a silent degrade before a customer does.
Foundation · 2
Production note
Evaluation gotchas: how a measurement lies to you
An eval is supposed to be the one thing in an AI system you can trust — the number that tells you the prompt got better, the agent didn't regress, the new model is safe to ship. But an eval can lie: it can measure memorization instead of capability, optimize a number while losing the goal, score with a biased judge, or pass offline and fail live. Ten gotchas that make a measurement look like proof when it isn't, each with the question to answer first and the cost of trusting the wrong number.
Decision framework
Evaluation Style Guide: the bar a change clears before it ships
The opinionated rules Cleon applies to every evaluation — the first decision (what to measure and where), offline versus online, how to grade (deterministic, LLM-as-judge, or human), and the 'eval every change' gate every other Style Guide in this catalog invokes. The discipline document that turns the evaluation gotchas into a checklist and principle 3 into practice: if you can't evaluate it, you can't ship it — and a measurement you can't defend is worse than none, because you act on it. The page that gives 'eval every change' its home, and composes Agentforce Testing Center, Anthropic eval tooling, and LangSmith by where the system runs rather than picks a camp.
Reference · 5
Reference
What is evaluation? Measuring whether the system works, instead of hoping
Evaluation is the discipline of measuring whether an AI system does its job — replacing 'it looked good in three tries' with a number you can score, compare, and defend. The eval loop: define success criteria, build an eval set, grade, iterate. Offline evaluation before you ship versus online evaluation on live traffic. The vocabulary the rest of this subcategory uses — eval set, golden dataset, ground truth, metric, judge, baseline, regression. And the three ways to grade — deterministic metric, LLM-as-judge, human — with when each fits. Principle 3: if you can't evaluate it, you can't ship it.
Reference
Eval datasets and metrics: the test set is the product spec
An eval is two halves: a dataset of cases and a way to grade the output on each one. This page builds both. The dataset mirrors the real task distribution and deliberately includes the edge cases, because the cases you leave out are the ones that break in production — and Anthropic's guidance is blunt about size: more questions with slightly lower-signal automated grading beats a handful of hand-graded ones. The grading half is a method per case — exact match, code-graded, multiple-choice, similarity, or LLM-graded — each with what it's good for and where it bites. Ground truth is where the right answer comes from and what it costs; versioning the set is what keeps two runs comparable. The same set then feeds the Console Evaluation tool, a LangSmith dataset, and the regression net for everything already shipped.
Reference
LLM-as-judge: grading output that has no single right answer
Exact match grades a sentiment label in one line. It cannot grade a support reply, a summary, or a conversational answer — open-ended output where two different wordings are both correct and there is no golden string to compare against. LLM-as-judge is the move there: a second model reads the output against a rubric and returns a score. The mechanic — the rubric is the scoring criteria, you pass it input plus output plus an optional reference, and you ask the judge to reason before it scores (Anthropic — improves judging on complex tasks). The feedback shapes: Boolean, Categorical, Continuous. The biases that make a naive judge lie — position, verbosity, self-preference — and the mitigations, ending on the one that matters most: calibrate the judge against human labels before you trust it. And it runs both ways — offline over an eval set, or online over live production traces.
Reference
Agentforce testing and observability: evaluating the agent where it lives
The platform-native half of the eval spine: how you test and observe an agent that runs on Agentforce inside the Salesforce security model. Before deploy — Testing Center, the low-code UI for running cases against the agent; the pro-code Agentforce DX path that generates a YAML test spec via the `agent generate test-spec` CLI; and the Testing API for programmatic batch runs. The three things a test case checks — the expected topic, the expected actions, and the expected outcome as a natural-language match. After deploy — Agentforce Observability: session traces exported in OpenTelemetry (OTLP) format, stored in Data 360, with quality scores and flags for low-performing topics. The in-platform instrument; the model layer and LangSmith are the off-platform half (principle 7).
Reference
Tracing and monitoring: catching the degrade an eval set can't see
An offline eval is frozen by definition — it grades the cases you thought of, before you ship. Production sends traffic no eval set anticipated, and that is where systems quietly rot: a model upgrade, a distribution shift, an upstream change moves the output and every offline test still passes. This page is the production half. Tracing: a trace and spans per request, logging inputs, outputs, latency, cost and tokens, tool calls, retrieved context, the metric score, and user feedback — each shown as a real table with why it matters. Online evaluation: run a judge or metric over live traces for real-time feedback, filter which runs to score, set a sampling rate so you're not grading every call. Catching the silent degrade: alert on a metric drop, not on a crash. Composed across two surfaces by where the system runs — LangSmith online evaluators off-platform, Agentforce session tracing exported in OpenTelemetry into Data 360 in-platform.