What is production readiness? The gap between a demo that works and an AI that runs on Monday

A demo answers one question: can the system do the thing at all, once, on an input you chose, in a room where nothing is at stake. Production answers a different one: does it keep doing the thing on the inputs you didn't choose, at the volume you can't control, when a wrong answer costs money or a wrong action can't be taken back. Almost every AI project that dies, dies in the space between those two questions — the model worked in the demo, and then nobody built the engineering that a demo never forces you to build. This is principle 1: a demo is not a product, and the gap is the engineering. Production readiness is the discipline of closing that gap on purpose, before a real user finds the edge you didn't.

The reason a demo lies is that it never meets the four things production meets every day. It never meets the long tail — the thousandth user phrasing the request a way your handful of test cases never did. It never meets the adversarial input — the person trying to make the system say or do something it shouldn't. It never meets the cost at scale — a per-call price that is a rounding error in a demo and the whole business case at a million calls a month (principle 6). And it never meets the irreversible action — the demo reads; production writes, refunds, deletes, and sends, and you cannot un-send an email to a customer. The map below is the set of dimensions that turn "it worked once" into "it runs unattended," and the spine that says where each one gets built.

The dimensions of production readiness

A system is production-ready when each of these is answered on purpose, not left to chance. Each is a discipline with its own page in this subcategory; here is the one-line version of what each means and the specific failure that hits you if you skip it.

Dimension	What it means	The failure if you skip it
Cost	Knowing the per-call price and the total at real volume — token budgets, the model-tier choice, caching the repeated context.	The bill that was trivial in the demo becomes the whole business case at scale, and the system gets killed for economics, not quality.
Latency	The answer arrives fast enough to be useful — model choice, prompt and output length, streaming so the user sees a response forming.	A correct answer that lands too slowly is a wrong answer to the user; they abandon, or the timeout fires upstream.
Safety / guardrails	Bounds on what the system can say and do — input and output checks, scope limits, a refusal path for what's out of bounds.	The adversarial input gets the answer it was fishing for, or the system acts confidently on a request it should have declined.
Governance	PII handled lawfully, the interaction logged, compliance provable — masking, retention rules, the audit trail.	Customer data leaks to a place it shouldn't, and when an auditor or a regulator asks what happened, you have no record to show.
Reliability	The system stays up under real conditions — rate limits respected, a fallback when the model is down, a rollback when a release is bad.	A provider blip or a bad deploy takes the feature down with no path back, and the outage is visible to every user at once.
Accountability	A human owns the outcome — a human-in-the-loop gate where it's customer-facing, and a trace of every run (principles 8, 9, 11).	The system does something wrong unattended, nobody is responsible, and there is no replay to reconstruct how it happened.

These are not a menu to pick from. A serious system answers all six, because the one you skip is the one that takes it down — and the one you skip is almost always the one a demo let you ignore. The rest of this subcategory is one page per dimension: cost and latency together (they trade off against each other), guardrails and safety, PII and governance, human-in-the-loop and accountability, and deploying to production for the reliability mechanics of getting a release out and back. (Those siblings are landing alongside this page; named here, linked once they ship.)

The spine: two surfaces, composed by where the system runs

Production readiness is not one set of tools. Like grounding, prompting, and evaluation before it, the surfaces compose — they are complementary instruments an engineer picks by where the system runs, never rival products to choose between (principle 7). The six dimensions above are constant; what changes is how much of each you get for free versus build yourself, and that is decided by where the system lives.

Agentforce and the Einstein Trust Layer — when the agent runs on Agentforce inside the Salesforce security model, several dimensions arrive as governance by construction. The Einstein Trust Layer sits between the agent and the model and covers a well-established set of controls: secure data retrieval that honors the running user's permissions, data masking so PII is not exposed to the model provider, zero data retention so prompts and responses are not kept or used to train an external model, toxicity scoring on generated output, and an audit trail of the interaction. Deployment rides the platform's release path, and human escalation — handing the conversation to a person — is a first-class move. You still own the dimensions the Trust Layer does not govern: it governs the language side, not what your Actions do, so the blast radius of an Action that can write or delete is still yours to bound (principle 5).
The off-platform stack — when the system runs off-platform, on the Claude API and your own infrastructure, every dimension is yours to build. You set the cost budget and the model tier, you reduce latency with model choice and streaming, you write the guardrails and the refusal paths, you mask PII and keep the audit log, you handle rate limits and build the fallback and rollback, and you wire the human gate and the trace. Nothing is free — but nothing is fixed either, which is exactly why off-platform is the right surface when the work needs a control flow, a model, or a policy the platform doesn't give you.

You do not pick one and pledge loyalty. A real system frequently runs both — an Agentforce agent owning the governed, on-platform actions on customer data, an off-platform agent owning a step it can't, with a clean handoff where accountability gets a seam (principle 9). The skill is composing the surfaces to the system you actually built, which is the same toolkit-composition logic that runs through the AI Engineering principles.

Evaluation is the deployment gate

The dimension that decides whether a change is safe to ship is evaluation — and in a production system it is not a one-time check, it is the gate every release passes through. Offline eval proves a new version is at least as good as the last one before any user sees it; online eval watches live traffic for the silent degrade after it ships. Production readiness is what evaluation gates for: the eval set is where you prove the guardrail holds against the adversarial input, the cost stayed within budget, the latency is acceptable, and the irreversible action only fires when it should. A release without an eval gate is the vibe-check trap at production scale — you are shipping on the impression that nothing broke, with no measurement that says so (principle 3: if you can't evaluate it, you can't ship it).

Where to go next

From here, the subcategory builds outward, one dimension per page, from the map this page laid out. Cost and latency is the economics-and-speed pair — the model-tier decision, token budgets, caching, streaming. Guardrails and safety is the input and output bounds and the refusal path. PII and governance is the lawful handling, retention, and the audit trail. Human-in-the-loop and accountability is the gate and the trace that put a person on the hook for what the system did. And deploying to production is the release mechanics — rate limits, fallback, rollback, the path out and back. The bar a system clears before it ships is the Production Style Guide. (Those siblings are landing alongside this page; named here, linked once they ship.)

Production is where the rest of this catalog gets cashed out. An agent is only as good as the dimensions that keep it safe to run unattended; grounding, prompting, and evaluation all aim at a system you can actually ship and operate. The model was never the hard part. This is the part that decides whether it runs on Monday.

AI Engineering principles — a demo is not a product (1), cost and latency are features (6), govern every tool's blast radius (5), a human is accountable (9), trace everything (11)
What is an agent — the system production readiness exists to make safe to run unattended
Agentforce agents — the Einstein Trust Layer in full, and the Action discipline the Trust Layer does not cover
What is evaluation — the deployment gate every production release passes through
Tracing and monitoring — the trace accountability depends on and online evaluation reads
What is grounding — a production answer is only as good as the facts under it
What is context engineering — the context budget that drives both cost and latency in production
Marketing Cloud AI Style Guide — the which-surface-for-which-job decision, one layer out
Production Style Guide — the pre-ship gate these dimensions become
Human-in-the-loop and accountability — the accountability dimension, in full

Reference:

The dimensions of production readiness

The spine: two surfaces, composed by where the system runs

Evaluation is the deployment gate

Where to go next

Related