Tracing and monitoring: catching the degrade an eval set can't see

An offline eval set is frozen the moment you build it. It grades the cases you thought of, against answers you wrote down, before any of it ships — and that is exactly what makes it a regression gate you can trust (eval datasets and metrics). But the frozen part is also its blind spot: production sends traffic no eval set anticipated, phrased ways you never tried, and the system that passed every offline case can start getting worse in the wild without a single test going red. The model underneath gets upgraded, the input distribution drifts as a new audience arrives, an upstream service changes a field — and the output degrades while your offline suite, by definition, keeps grading the same green cases. This is the silent degrade, and it is the failure offline evaluation structurally cannot catch.

This page is the production half of the subcategory. Where what is evaluation drew the line between offline and online, and llm-as-judge showed the same judge running in both modes, this page builds the online side: how you trace what actually happened in production, how you evaluate live traffic, and how you catch the degrade before a customer does. It is principle 11 made operational — trace everything, because you can't debug, or evaluate, what you can't replay — and it is what turns a system that was correct at launch into one that stays correct on a Tuesday three months later.

This is a reference for the discipline across surfaces. The online-evaluator harness below is LangSmith's, the in-platform equivalent is Agentforce's session tracing exported through OpenTelemetry into Data 360; the discipline transfers, the exact knobs are each vendor's.

Tracing: one trace, spans inside it, per request

Before you can evaluate or debug a production run, you have to have captured it. A trace is the record of a single request end to end; the spans inside it are the individual steps — the retrieval call, the model call, each tool invocation — each with its own timing and payload nested under the trace. One user request produces one trace; everything that happened to serve it lives in the spans beneath. Without that record, a complaint about a bad answer last Tuesday is unanswerable: you have the output and nothing about how the system arrived at it.

The discipline is to log enough that you can fully reconstruct the run from the trace alone — replay it in your head without re-running anything. Concretely, that is the following, captured per request:

What to log	Why it matters
Input	The request as the system received it — the prompt, the user message, the parameters. The starting point of any replay; without it you cannot tell whether a bad answer came from a bad question or a bad system.
Output	The final response returned to the user. The thing being graded, and the thing a complaint is about.
Latency	Wall-clock time for the request, and per span. Cost and latency are features (principle 6); a correct answer that arrives too slowly has not solved the problem, and a per-span breakdown shows which step is the slow one.
Cost and tokens	Input, output, and cached tokens, and the resulting cost. Lets you watch the bill move with traffic and catch a prompt change that quietly doubled token use before the invoice does.
Tool and action calls	Every tool the run invoked, the arguments it passed, and what came back. Each tool is a blast radius (principle 5); the trace is the audit trail of what the agent actually did, not just what it said.
Retrieved context	The chunks or records grounding pulled in for this request. When an answer is wrong but fluent, the failure is usually retrieval, not the model — and you can only see that if the trace shows what the model was actually given to work with.
Metric / eval score	The score an online evaluator assigned this run, written back onto the trace. This is what turns a pile of logs into something you can chart, alert on, and filter by.
User feedback	The signal from the person on the other end — a thumbs up/down, an accepted-or-rejected answer, an escalation to a human, a rephrase-and-retry. Ground truth that production gives you for free, and the best filter for which traces are worth a closer look.

The throughline of the table is principle 11: each row is something you will wish you had logged the first time a run goes wrong and you try to reconstruct it. The retrieved-context row and the tool-call row are the two teams skip most often and regret most — they are precisely what you need to tell a retrieval failure from a reasoning failure, and a tool that misfired from a model that hallucinated calling it. Log them at write time; you cannot add them to a trace after the request is gone.

Online evaluation: grading live traffic, not a frozen set

Tracing tells you what happened. Online evaluation scores it. The same judge or metric from the offline loop — an LLM-as-judge, a deterministic check — runs over live production traces instead of a curated eval set, giving you, in LangSmith's words, "real-time feedback on your production traces." The mechanic from llm-as-judge is identical; only the source of the input changes — instead of a case you authored, the harness pulls the variables out of each live trace and hands them to the judge as the request flows through.

Two controls make this affordable and useful, and you set both:

Filter which runs to score. You do not evaluate every trace blindly. A filter decides which runs trigger evaluation — LangSmith lets you filter on user feedback that flagged a response as unsatisfactory, on a specific tool invocation, or on custom metadata like a customer tier, and notes that "filters on evaluators work the same way as when you're filtering traces in a project." Scoring the runs a user already thumbed-down is far higher-signal than scoring a random slice; the filter is how you point the judge at the traffic that matters.
Set a sampling rate. Even within the filtered set, you rarely need to grade every call. The sampling rate sets what fraction of matching traces actually get evaluated — set it to 0.1 and only 10 percent of matching traces are scored, which LangSmith documents as the lever for cost management. A judge call costs money and adds latency; the sampling rate is how you buy a statistically useful signal without paying to grade all of production. (One operational note: LangSmith auto-upgrades a trace to extended retention when an online evaluator runs on it, so the evaluated traces — the ones you most want to keep — are preserved.)

The point of online eval is not to grade everything; it is to keep a continuous, sampled read on quality as real traffic flows, so the number that proves the system worked at launch keeps getting computed on the traffic the system is serving now.

Catching the silent degrade: alert on the drop, not the crash

Here is the failure this whole half of the subcategory exists for. A crash pages you — the request 500s, the queue backs up, someone notices in minutes. A silent degrade does not: nothing throws, every request returns a plausible answer, latency looks fine, and the system is quietly getting worse. The faithfulness score slips from 0.94 to 0.78 over a week because a model upgrade shifted behavior. The acceptance rate drops because a new customer segment asks questions the grounding never covered. An upstream service starts returning a field in a new shape and retrieval quietly degrades. None of these is an error; all of them are the system failing at its actual job while the infrastructure dashboard stays green.

The offline eval set cannot catch this, because it is frozen — it keeps grading the same cases at the same scores, by construction, no matter what production traffic does. The thing that catches it is the online metric, watched over time: you chart the eval score and the production signals (acceptance, escalation, latency, cost) trace by trace, and you alert on a metric drop, not on an exception. When the faithfulness judge's rolling average falls below a threshold, that fires — the same way a latency spike or an error rate would — and you investigate before the trickle of complaints becomes a flood.

This is the operational meaning of principle 1 — a demo is not a product. The demo proved the system could do the thing once; monitoring is what tells you it is still doing the thing, on traffic nobody scripted, on a model that moved underneath you. A system on a moving model degrades silently by default; the trace plus the online metric plus the alert is the apparatus that makes the degrade loud.

The composition: same discipline, two surfaces

Tracing and online evaluation are one discipline that runs in two places, picked — as always in this catalog — by where the system runs, never as rival products to choose between:

LangSmith — off-platform. When the system runs on a custom control loop or a LangGraph stack, LangSmith is where the traces land and the online evaluators run. You define the judge, attach a filter and a sampling rate, and it scores live production traces for real-time feedback, writing the score back onto each trace where you can chart it and alert on it. This is the online half of the same harness that ran the offline eval in eval datasets and metrics — one tool, the dataset side and the production side.
Agentforce — in-platform. When the agent runs on Agentforce inside the Salesforce security model, the trace is already being captured: session tracing records turns, messages, LLM calls, actions, metric scores, and feedback — the same columns as the table above — into Data 360. Salesforce's Export Session Tracing Data API serves that as OpenTelemetry (OTLP) — it extracts "unified trace, metric and log data" covering "every step of an Agentforce agent interaction" and returns it in OTel ResourceSpans format, so you can feed it straight into an OTLP-native observability platform (Splunk, Datadog, New Relic) or any OTLP collector with no conversion step. The quality scores ride along in the same export, so the monitoring story is the same as off-platform — score over time, alert on the drop — only the trace originates inside the platform and flows out through a standard rather than living in a separate harness.

You do not pick one and pledge loyalty. An agent built in Agentforce and grounded on Data 360 is traced and scored inside the platform, its OTel export feeding your central observability stack; a Claude-plus-LangGraph system off-platform is traced and online-evaluated in LangSmith. A team running both monitors each system where it runs and watches the same two things in both: the metric over time, and the alert when it drops. The discipline — trace every request, evaluate a sampled slice of live traffic, alert on the degrade — is identical; the surface only decides where the trace originates and which harness holds it.

The throughline

Offline evaluation is frozen by design, which is what makes it a trustworthy regression gate and also what blinds it to the silent degrade — the model upgrade, the distribution shift, the upstream change that moves the output while every offline case stays green. Production observability is the other half. Trace every request — input, output, latency, cost and tokens, tool calls, retrieved context, the eval score, user feedback — so any run can be reconstructed from the trace alone. Run a judge or metric over live traces for real-time feedback, filtered to the runs that matter and sampled so you are not grading every call. Then chart the score over time and alert on a metric drop, not on a crash, because the degrade that costs you most is the one that never throws an error. Do it in LangSmith off-platform or through Agentforce's OTel export into Data 360 in-platform — same discipline, two surfaces. Offline eval proves the system was good enough to ship; tracing and monitoring is how you know it stayed good once it did.

LLM-as-judge — the same judge, run online over live traces; its offline-and-online section is the bridge into this page
What is evaluation — the offline/online split this page builds the online half of
Eval datasets and metrics — the offline harness whose online counterpart runs here, on the same LangSmith surface
Evaluation gotchas — the ways monitoring itself misleads you, and how to read a metric move honestly
Agentforce testing and observability — the in-platform surface this page exports from, in depth
Debugging evals — what you do once a metric drops and you open the trace to find out why
Evaluation Style Guide — the bar an online evaluator clears before you trust its scores enough to alert on them
Debugging agents — reading the trace of a single bad run, the per-request view this page aggregates
Tools and actions — every tool call the trace records is a blast radius the audit trail exists to cover
What is grounding — the retrieved context the trace logs, so you can tell a retrieval failure from a reasoning one
AI Engineering principles — trace everything (11), a demo is not a product (1), cost and latency are features (6)

Reference: