Deploying to production: the safe path from a passing eval to live traffic
The eval is green — now how do you actually ship it without learning the hard way that green offline is not green in production? Six steps that take a prompt, model, or agent change from a passing test to live traffic with a way back: build and test in an isolated environment (Agentforce DX moves agent metadata between scratch orgs, sandboxes, and prod; off-platform, a staging environment), pass the eval gate before merge, version the change so you know exactly what shipped, roll out gradually behind a canary instead of flipping 100 percent at once, keep a one-step rollback ready, and monitor on live traffic after — because the silent degrade a frozen set can't see is caught by online eval and tracing. The throughline: deployment is not the finish line. It's where evaluation and observability start doing their real work.
The eval passed. The instinct is to ship — and shipping straight from a green offline run to 100 percent of production is exactly how a change that looked safe becomes the incident nobody saw coming. A passing eval is necessary, not sufficient: it proves the change is at least as good as the last one on the cases you froze, and production is the distribution you didn't freeze (what is production readiness). Safe deployment is the engineering between "the test is green" and "real traffic is on it" — and the whole point of that engineering is that every step has a way back, so the worst case is a fast recovery instead of a fix-forward scramble while the system is broken in front of customers.
This is a process how-to: a numbered path from a passing eval to live traffic, each step the thing you do and why it matters. None of it is toolkit-specific. Production and governance is one discipline over two surfaces, composed by where the system runs (principle 7): on Salesforce, Agentforce changes move through Agentforce DX with the platform's release path, sandboxes, and the Einstein Trust Layer governing the language side by construction; off-platform, the Claude API plus your own infrastructure means you build the staging environment, the gate, the versioning, the canary, and the rollback yourself. The steps below are the same on both surfaces; only who builds them differs.
The safe-deploy path
1. Build and test in an isolated environment
Never author or change an AI system directly against production. The first move is an environment where a mistake costs nothing — where you can break the prompt, mis-wire a tool, or ship a bad model config and the only thing watching is you, not a customer. The demo-to-prod gap (production gotcha 9) lives precisely in the changes that "looked fine" against the one input you tried; an isolated environment is where you try the inputs that aren't fine before they reach traffic.
On Salesforce, this is what Agentforce DX is for. It "provides CLI commands and a VS Code extension to create, preview, and test agents outside Agentforce Studio, and to move agent metadata between your DX project and your scratch orgs, sandboxes, and production orgs" — so you develop in a scratch org or sandbox, keep your version control as "the source of truth," and promote the same metadata up to production through a path the platform manages. Off-platform, the equivalent is a staging environment that mirrors production — same model, same retrieval, same tools pointed at non-production data — so what you test is what you ship. The question this step answers: is there anywhere you can get a change wrong without a customer being the one who finds out? If the answer is "only in production," you don't have a deploy process yet.
2. Pass the eval gate
The frozen eval set runs as a pre-deploy, pre-merge check — and the rule is binary: the baseline is beaten or held, or the change does not ship. This is the gate the whole evaluation discipline exists to power; "eval every change" is the sentence every Style Guide in this catalog points at, and this is where it bites for a deploy. A change that drops a case below baseline is blocked at the gate, not discovered in production — that is the difference between a red test and a customer complaint.
The mechanics: hold the set fixed, score the new version against it, and compare to the last known-good baseline. A drop blocks the merge; a hold or a gain clears it. On Salesforce, Agentforce Testing Center runs cases against the agent before release; off-platform, the eval run is wired into the merge so the change can't land without a green score (see the Evaluation Style Guide for what to measure and how, and eval datasets and metrics for the set itself). The question: is "it's ready" backed by a scored run over the hard cases — or by a green offline pass on the happy path and a hope about the rest? An eval that doesn't gate the merge is documentation; an eval wired into the merge is what stops the silent regression.
3. Version the change
Whatever ships — the prompt, the model version, the agent config, the tool definitions — is versioned, so you know exactly what went out and can compare it against what came before. A change you can't name is a change you can't roll back to or away from, and an AI system has more moving parts than a code deploy: a one-word prompt edit, a model-version bump, and a retriever swap can each shift behavior across the whole input distribution, and if they shipped together unversioned you can't tell which one moved the number.
On Salesforce, Agentforce DX treats your version control as the source of truth for the agent metadata — "periodically check the new and updated metadata into your VCS, such as GitHub" — so each agent version is a tracked, diffable artifact. Off-platform, you version the prompt, the model config, and the tool schema in source control the same way you version code, tagged so a given production state maps to a known commit. This is what makes step 2's before-versus-after comparison meaningful and step 5's rollback possible: you can only return to the last known-good version if you know precisely what it was. The question: if someone asked "what exactly is running in production right now," could you point to a version — or would you be reconstructing it from memory?
4. Roll out gradually
A change that cleared the gate still goes to a slice of traffic first, not to everyone at once. A canary — a small fraction of real requests routed to the new version while the rest stay on the old — turns the blast radius of a bad change from "every user" into "a few percent of users, briefly." The offline eval is a curated sample; the canary is the first contact with the distribution you didn't curate, and it's where a regression the frozen set couldn't represent shows up while it's still cheap to reverse.
Watch the canary on live signals, not faith: the online eval and tracing scores running over the canary traffic, plus the operational metrics — latency, error rate, cost per call (cost and latency). If the slice holds the baseline and the metrics stay healthy, widen it; if it degrades, you roll back having affected a fraction instead of all of production. On Salesforce, you stage the rollout through the platform's release controls; off-platform, you build the traffic split and the per-version metrics yourself. The question: when this change goes live, does it reach a slice you're watching first — or 100 percent of traffic the moment it merges?
5. Keep a rollback ready
Before the rollout widens, there is a one-step way back to the last known-good version — ready, owned, and tested before you need it. This is the direct fix for the "deploy with no rollback" trap (production gotcha 7): if your only recovery from a bad change is to author and ship a fix forward, the degrade runs for the length of that fix while production is broken. A rollback you can pull in one move turns a bad deploy from an incident into a non-event.
What "one step" requires is the versioning from step 3: the previous prompt, model config, and agent metadata are retained and promotable, so reverting is selecting the last good version, not rebuilding it. On Salesforce, the prior agent metadata lives in version control and redeploys through Agentforce DX; off-platform, the previous version stays one switch away — a config flip, not a code change and a deploy. Test the rollback before launch the same way you'd test a kill switch (production gotcha 10) — a rollback you've never exercised is a rollback you're trusting on faith at the worst possible moment. The question: if the change you just shipped turns out bad, can you put the last good version back in one move — or is recovery a fix-forward scramble?
6. Monitor after
Deployment is not the end of evaluation — it's where the online half starts. A frozen eval set is, by construction, blind to the drift it doesn't contain: a shift in what users ask, a model-provider change underneath you, an edge case the set never had. Online eval and tracing on live traffic is what catches the silent degrade the frozen set can't see — the same trace stream that serves as the audit trail and the data a human reviews when accountable for an outcome.
The mechanics: keep scoring a sample of live traffic against your criteria after the rollout completes, watch the operational metrics, and feed the production failures back into the eval set so the next change that would re-introduce them trips a red case at step 2 instead of a complaint in the field. That loop — production surfaces a failure, the failure becomes a frozen case, the gate catches it next time — is what keeps the eval set honest as the distribution keeps moving. On Salesforce, Agentforce Observability exports the production traces and metric scores; off-platform, you run the online scoring and the trace pipeline. The question: after this change is fully live, is anything watching the traffic the offline eval couldn't represent — or did monitoring stop at the green pre-deploy run?
A pre- and post-deploy checklist
The six steps as a gate you can run down before and after a change goes out:
| Stage | Check | If it fails |
|---|---|---|
| Before | Built and tested in an isolated environment (scratch org / sandbox / staging), never against prod | Stop — get an environment where a mistake costs nothing first |
| Before | Eval set scored against the change; baseline beaten or held | Blocked — the change does not merge until the score recovers |
| Before | The exact prompt / model / config version is tagged and tracked | Stop — an unversioned change can't be rolled back to or away from |
| Before | Rollback path exists, is owned, and has been tested | Stop — a rollback you've never exercised is faith, not a plan |
| Rollout | Change reaches a canary slice first, watched on online eval + metrics | Stop — never go straight to 100 percent of traffic |
| After | Online eval + tracing running on live traffic; failures feed back into the set | Stop — without it the silent degrade ships unseen |
The checklist is the operational form of the throughline: every row is a step that has a way back, and the column on the right is what "no way back" would have cost.
The throughline
Deployment is not the finish line — it's where evaluation and observability start doing their real work. The green offline run earns a change the right to go to a slice of traffic behind a rollback, not the right to go to everyone; the eval gate proves the change is safe against the cases you have, the canary tests it against the cases you don't, the rollback is the seatbelt for when the canary is wrong, and the online monitor is what notices the slow degrade that no pre-deploy check could. The model was the easy part, and even the eval was only half the job — the other half is the discipline of shipping it so that when it's wrong, and a non-deterministic system eventually will be, the worst case is a fast revert and a new frozen case instead of a customer-facing incident with no way back. Ship behind a gate, roll out gradually, keep the way back ready, and watch what you couldn't test.
If your team ships AI to production with a step this path is missing, write to hello@wearecleon.com — we add it, with credit.
Related
- What is production readiness — the six dimensions this deploy path operationalizes
- Production gotchas — the deploy-with-no-rollback trap (7) and the demo-to-prod gap (9) these steps close
- Cost and latency — the operational metrics the canary and the post-deploy monitor watch
- PII and governance — the audit trail the post-deploy trace stream doubles as
- Human-in-the-loop and accountability — who owns the rollback decision and reads the live traces
- Production Style Guide — the bar a change clears before it ships
- Evaluation Style Guide — the "eval every change" gate step 2 invokes
- Eval datasets and metrics — the frozen set the gate scores against and failures feed back into
- Tracing and monitoring — the online eval and trace stream steps 4 and 6 depend on
- Agentforce agents — Agentforce DX, the Trust Layer, and the platform release path in full
- AI Engineering principles — a demo is not a product (1), if you can't evaluate it you can't ship it (3), non-determinism needs a gate (8), trace everything (11)
Reference: