Agentforce testing and observability: evaluating the agent where it lives

Reach for Agentforce testing and observability when the agent you are evaluating runs on Agentforce, inside the Salesforce security model, over Data 360. This is the platform-native build of the eval loop what is evaluation lays out — define success criteria, build an eval set, grade, iterate — except the grading surface and the production trace are part of the platform you already run the agent on. The reason to pick it is fit, not loyalty: when the agent lives on Agentforce, you test it and watch it where it lives, the eval inherits the security model by construction, and the trace lands next to the data the agent acted on.

This page is how that evaluation is assembled — the before-deploy half and the after-deploy half — and what you own versus what the platform owns. You write the test cases and read the traces; the platform runs the cases against the agent, scores them, and emits the production trace in a standard format. It is one instrument, not the whole kit: the same agent is often tested in Testing Center, graded at the model layer with an LLM-as-judge, and traced off-platform in LangSmith, each surface doing the part it is built for. The vocabulary here — eval set, ground truth, regression, online versus offline — is the vocabulary from what is evaluation; this page assumes it and shows where each piece lands on Agentforce.

Before deploy: Testing Center

Testing Center is the surface for testing an Agentforce agent before it ships — the offline eval from what is evaluation, run against the agent in your development org rather than in a lab. You give it a set of test cases, each pairing an utterance the user might send with what a correct response looks like, and it runs the agent over them and scores the result. It is regression testing for an agent: change a topic, an action, or an instruction, re-run the cases, and see whether the score held before any user is exposed.

There are three ways into it, and they map to three audiences — the same low-code-to-pro-code ladder the rest of the platform offers.

The Testing Center UI is the low-code path: build and run test cases in Setup, no code, the surface an admin or agent-builder uses to test the agent they configured. It can also generate test cases for you from the agent's own subagents, actions, and knowledge sources, so the eval set does not start from a blank page.
Agentforce DX is the pro-code path, and Salesforce frames it as the pro-code equivalent of the Testing Center UI. You generate a test spec — a local YAML file describing the test cases — with the agent generate test-spec CLI command, customize it in your editor, then agent test create to register it in the dev org and agent test run to execute it. The spec lives in version control next to the agent's other metadata, which is the point: the eval set is reviewed, diffed, and shipped like code, not maintained by hand in a UI.
The Testing API is the programmatic path — a REST API for building tests that automate the evaluation and assess many requests in a short time. It is the surface you reach for when the eval set is large enough that batch runs matter, or when testing has to run from CI rather than a person clicking through Setup. (Salesforce also documents running tests through the Connect API as a related path.)

The three are the same eval, configured for three contexts: the UI when a builder owns the agent, DX when the eval set belongs in version control, the API when it has to run in a pipeline. Reach for the narrowest one that fits, the same way you would pick a no-code retriever before a custom one on the grounding side.

The three things a test case checks

A test case does not just ask "was the answer good?" — it checks three specific things, each isolating a different stage of what the agent did. This is what makes an agent test diagnostic rather than a vibe check: when a case fails, the type that failed tells you where it broke.

Test type	What it checks	Passes when
Topic	Did the agent route to the expected topic — the right area of its instructions for this utterance?	The agent selects the topic (subagent) the case expects.
Action	Did the agent invoke the expected action(s) — the right tools, in the right place?	The agent calls the action(s) the case names, by their API name.
Outcome	Does the actual outcome match the expected one — a natural-language description of the result?	The actual response matches the expected outcome's gist, evaluated semantically — so it passes even when the wording differs.

The split matters because the three fail independently and for different reasons. A Topic miss means routing went wrong — the agent picked the wrong area of its instructions before it did anything, the failure debugging agents traces first. An Action miss means it routed correctly but invoked the wrong tool, or skipped one it should have called — the tools and actions layer. An Outcome miss means it routed and acted correctly but the response was still wrong. Scoring the outcome on its natural-language gist rather than an exact string is what lets the test survive harmless rewording — the same reason LLM-as-judge exists for prose a deterministic check can't grade. When you do need an exact check — a specific string or number that must appear verbatim — the test spec carries a separate optional custom evaluation field for it, alongside the semantic outcome.

That separation is the whole value of an agent-shaped eval. "The answer was wrong" is a complaint; "the Topic was right, the Action was right, the Outcome missed" is a diagnosis you can act on.

After deploy: Agentforce Observability

Testing Center proves the agent is safe to ship. It cannot tell you whether it stayed good once real traffic hit it — the silent-degrade problem what is evaluation names as the job of online evaluation. That is Agentforce Observability: the after-deploy half, watching what the agent actually does in production rather than what it did against a curated set.

The foundation is the session trace — the replay of a real production session, exported through the Agentforce Session Trace OTel API in OpenTelemetry (OTLP) format, the open standard observability platforms and OTLP collectors already speak. A trace captures the full anatomy of a session: turns, messages, LLM calls, actions, metric scores, and feedback — every step from the user's utterance through the model calls and tool invocations to the response, with the scores and feedback attached. This is principle 11 made concrete on the platform — you can't debug, or evaluate, what you can't replay — and the trace is exactly that replay, in a format you can pull into the observability tooling you already run.

The traces are stored in Data 360, and the API queries them with Data 360 SQL. That placement is the platform's quiet advantage: the production record of what the agent did lands in the same unified store the agent grounds on through its retrievers, so the trace of an answer sits next to the data that answer was built from. On top of the trace, Observability surfaces quality scores and flags for low-performing topics — the online-eval signal that points you at which part of the agent is drifting, not just that something is. The surfaces live in Setup, under Einstein — audit, analytics, and monitoring — the platform's home for what the agent is doing now versus what the test cases said it should.

The composition point: Testing Center here, the model layer and LangSmith there

Agentforce testing and observability is one instrument, and the honest framing is composition, not choice (principle 7). The platform-native path is the right one when the agent runs on Agentforce inside the security model — you test it where builders own it, the eval inherits the security model, and the production trace lands in Data 360 next to the data the agent acts on. The model layer is where you grade the raw model behavior independent of where it is embedded — the define-criteria-then-grade loop with an LLM-as-judge for everything a deterministic check can't reach. LangSmith is the surface when the system runs off-platform on a custom control loop — datasets, offline eval, and online eval over live traces, in one place off the platform.

A real system frequently uses more than one. An agent grounded on Data 360 and orchestrated partly off-platform might be tested in Testing Center, graded at the model layer with an LLM judge, and traced in LangSmith — each surface doing the part it is built for, the same way an Agentforce retriever and an external RAG pipeline compose behind one agent. The skill is matching the eval surface to where the agent runs, not defending the instrument.

And the precondition is the one eval datasets and metrics sets everywhere: Testing Center runs whatever cases you give it, so the test is only as honest as the eval set under it. A handful of utterances you invented at a desk is the vibe check wearing a UI; the cases worth running are the ones drawn from real failures — the topics that got mis-routed, the actions that got skipped, the outcomes a user complained about. The platform runs the test and emits the trace; you still own whether the cases are worth running. (See evaluation gotchas for the failure modes a platform-native eval still inherits, tracing and monitoring for the trace this page exports read in depth, debugging evals for when the eval itself misleads you, and the Evaluation Style Guide for the bar an eval clears before you trust it.)

The throughline

Agentforce testing and observability is the platform-native half of the eval spine. Before deploy: Testing Center runs cases against the agent — through the low-code UI, the pro-code Agentforce DX path with a YAML test spec generated by agent generate test-spec, or the Testing API for batch runs — and each case checks three things: the expected topic, the expected actions, and the expected outcome as a natural-language match. After deploy: Observability exports the session trace in OpenTelemetry (OTLP) format — turns, messages, LLM calls, actions, metric scores, feedback — stores it in Data 360, and flags low-performing topics. You own the cases and the reading of the traces; the platform runs the cases, scores them, and emits the trace inside the security model the agent already runs in. That division is the appeal and the limit: when the agent lives on Agentforce, it is the most direct path to a governed eval next to the data; when the system runs off-platform, that is the signal to grade at the model layer and trace in LangSmith. The skill is knowing where the agent runs, not defending the surface.

What is evaluation — the eval loop this page builds in Agentforce terms, and the offline/online split it lands on the platform
Eval datasets and metrics — the eval set Testing Center runs, and why it is only as honest as the cases under it
LLM-as-judge — the semantic grading the Outcome check uses, and the model-layer surface this page composes with
Tracing and monitoring — the session trace this page exports, read in depth as the basis of online evaluation
Debugging evals — when the eval itself misleads you: a judge that scores wrong, a set that drifted
Evaluation gotchas — the failure modes a platform-native eval still inherits
Evaluation Style Guide — the bar an eval clears before you trust its scores
Agentforce retrievers — the grounding the agent under test draws on, and where its production traces land in Data 360
Debugging agents — tracing a run when it goes wrong, the same replay observability depends on
Tools and actions — the actions the Action test type checks the agent invoked
AI Engineering principles — if you can't evaluate it you can't ship it (3), trace everything (11), compose the toolkit (7)

Reference:

Before deploy: Testing Center

The three things a test case checks

After deploy: Agentforce Observability

The composition point: Testing Center here, the model layer and LangSmith there

The throughline

Related