AI Engineering: principles from production

Most AI projects die in the same place: the gap between a demo that wowed the room and a system that runs on Monday morning, on real data, at real volume, without a human babysitting it. The model was never the hard part. The hard part is the engineering around it — the retrieval, the tools, the evaluation, the guardrails, the monitoring, and the humans who have to trust the output enough to act on it.

These are the principles Cleon applies when we build an AI system — whether the agent runs on Agentforce inside Salesforce, or on LangGraph and the Claude API outside it. Those tools are complementary instruments we compose to fit the job, not rival camps to pick between. Each principle below is anchored in implementation work, not in keynote slides — a synthesis of where the official guidance is right, where the practitioner community has corrected it, and where our own production experience departs from both.

The throughline: an AI system is grounded, evaluated, governed, and shipped — or it is a demo. Everything here is in service of crossing that gap.

The principles

1. A demo is not a product — the gap is the engineering.

A great demo proves the model can do the thing once. Production asks whether it reliably does the thing every time, on inputs nobody scripted, cheaply enough to be worth running. Those are different questions, and the distance between them is where the actual work lives.

Treat the demo as the start of the project, not the end of it. The moment a stakeholder says "ship that," the real estimate begins — data prep, evals, error handling, monitoring, and the rollout to humans who have to change how they work.

2. Ground before you generate.

A clever prompt over no retrieval is a confident guess. What makes an answer true is grounding it in your data — Agentforce retrievers over the Data 360 profile, or an external RAG pipeline over your documents. The model supplies fluency; grounding supplies the facts.

3. If you can't evaluate it, you can't ship it.

"It looked good in three tries" is not a test — it's a vibe. An evaluation set is the difference between improving a system and guessing at it. Without one, every prompt change is a coin flip you can't score, and every regression ships silently.

Build the eval set from real failures as they happen. The first ten cases that bit you in testing are worth more than a hundred synthetic ones, because they encode the ways your problem actually breaks.

4. The model is the easy part; the system is the job.

Swapping in a stronger model is an afternoon. Building the retrieval, the tools, the state management, the guardrails, the fallback paths, and the observability around it is the quarter. Budget accordingly — the model line item is the smallest one.

This is also why "wait for the next model" is rarely the answer. A better model makes a well-engineered system better; it does not rescue one that has no evals, no grounding, and no guardrails.

5. Every tool an agent can call is a blast radius — govern it.

The moment an agent can act — send a message, update a record, move money — the stakes change from "wrong answer" to "wrong action." Give each tool the narrowest scope that does the job, validate every argument before it runs, make it idempotent where you can, and put a kill switch on anything consequential.

6. Cost and latency are features.

An AI system that produces the right answer too slowly, or too expensively to run at the real audience size, has not solved the problem. Token budgets, step caps, caching, and the model-tier decision (a small model for the routine call, a large one only where it earns it) are design choices, not afterthoughts to bolt on when the bill arrives.

Model the cost against the real volume before you build, not after. A per-call cost that is trivial in a demo becomes the whole business case at a million calls a month.

7. Compose the toolkit to the job.

Agentforce when the work lives in the Salesforce security model and needs governed, auditable actions on customer data. LangGraph and the Claude API when the work is off-platform, needs a custom control loop, or spans models and systems Salesforce does not reach. MCP when you want tools and data to interoperate across them. These are complementary instruments; the skill is choosing and combining them for the job, not pledging loyalty to one.

The decision of which surface for which job — Agentforce, Einstein, or an external model inside Marketing Cloud — is its own framework; see the Marketing Cloud AI Style Guide.

8. Non-determinism needs a human gate where it's customer-facing.

The same input can produce a different output tomorrow. That is fine for a draft, dangerous for anything a customer sees unreviewed. Put a validation step and a fallback value on every generation, and a human approval gate on anything open-ended that ships to a person.

Never let a model failure render as a blank, an error string, or a hallucination in front of a customer. A boring deterministic fallback beats an exciting wrong answer every time.

9. A human is accountable for what the AI does — always.

AI does not move accountability to the model. Someone owns each output: they can explain it, defend it, and answer for it when it is wrong. Design the system so that person exists, knows it is them, and has the controls (review, override, kill switch) to act on it.

"The model decided" is not an acceptable answer to "why did this happen," and building as if it were is how you get an incident with no owner.

10. Context is a budget, not a bucket.

Stuffing everything you have into the prompt does not make the model smarter — past a point it makes it worse, as the signal you need drowns in the context you added "just in case." Curate what the model sees: the right retrieved chunks, the relevant history, the instructions that matter for this step.

More context is a cost in tokens, latency, and reasoning quality. Spend it where it earns its place.

11. Trace everything — you can't debug what you can't replay.

When an agent loops, calls the wrong tool, or quietly gets worse, the first question is "what actually happened," and you can only answer it if you captured the trace. Log the inputs, the retrieved context, the tool calls and their results, and the final output for every run. Observability is the precondition for fixing a silent degrade — and silent degrades are the norm, not the exception, in systems built on a moving model.

12. Start from the problem, not the model.

The interesting question is never "can the model do this?" It is "is this the cheapest way for a business to do this, once you add data prep, monitoring, and the humans who use it?" Plenty of things a model can do are not worth doing with one. Start from a problem worth solving and a value worth the cost; reach for AI when it is genuinely the best tool, not because it is the exciting one.

Closing

These are not rules to memorize; they are the muscle you build after the same AI projects die the same way at the same gap. They are a synthesis — the official guidance where it holds, the community's hard-won corrections where the docs are optimistic, and Cleon's production experience where both leave gaps.

If you spot a violation of any of them in our work, write to hello@wearecleon.com — we fix it and we say so.