Marketing Cloud AI gotchas: where Einstein, Agentforce, and external models bite

"AI in Marketing Cloud" is not one thing. It's three, and they fail in three different ways. Einstein features ship inside the platform — Engagement Scoring, Send Time Optimization, Content Selection, Copy Insights — and their failure mode is silent: a model trained on the wrong data scores confidently and wrong. Agentforce reaches in from the Salesforce platform side, grounded on Data Cloud, and its failure mode is the data model underneath it. External AI — an LLM you call yourself from a CloudPage or SSJS — fails the way any external API call fails, plus the ways a non-deterministic model fails on top.

Ten gotchas across the three surfaces, synthesized from Cleon's Marketing Cloud builds, Salesforce's official guidance, and the corrections the practitioner community learned the hard way. Each is paired with the question to answer before you ship and the cost of getting it wrong. The framing isn't "what's the right answer" — it's "what's the question you have to be ready to defend."

The gotchas

1. Einstein scores are only as good as the engagement history under them — a thin send history scores noise

Einstein Engagement Scoring and Send Time Optimization learn from your tenant's actual open, click, and conversion history. On a Business Unit with months of consistent sending, the scores mean something. On a new BU, a low-volume program, or a contact with three lifetime sends, the model is extrapolating from almost nothing — and it reports a score anyway, with no confidence flag in the UI.

The cost is a campaign optimized on noise that looks optimized on signal. Before you let Einstein drive send time or audience selection, answer: does this Business Unit have enough engagement history, for these contacts, for the score to mean anything? If not, the score is decoration.

2. Send Time Optimization moves the send — make sure the journey can tolerate the spread

STO doesn't send at the time you scheduled; it sends at the time it predicts each contact engages, which can spread a single send across a 24-hour window. That's the point. But it also means a time-sensitive send — a flash sale ending at midnight, an event reminder for a 9am session — can land after the moment has passed for the contacts STO decided engage tomorrow morning.

The cost is a reminder that arrives after the event. The question: is this send's value stable across a 24-hour spread, or does it have a hard deadline that STO will happily miss?

3. Content Selection and Copy Insights are suggestions, not approvals — a human still owns what ships

Einstein Content Selection picks the best-performing asset per contact; Copy Insights predicts subject-line performance. Both are genuinely useful and both are advisory. Treating either as an approval step — auto-shipping the AI's pick with no human read — is how a tone-deaf subject line or an off-brand image reaches a million inboxes with no one's name on it.

The cost is a brand mistake at send scale that nobody chose. The question to answer once, as policy: where in the workflow does a human approve what Einstein suggested, and is that gate enforced or merely encouraged?

4. Agentforce is only as good as the Data Cloud model under it — a fragmented model grounds a confidently wrong agent

An Agentforce agent answering questions about a customer reads the unified profile in Data Cloud. If identity doesn't resolve, relationships aren't modeled, and objects aren't documented, the agent inherits the same fragmented mess a human analyst would — and unlike the analyst, it answers confidently anyway. "Agent-ready" is not a state you reach by turning Agentforce on; it's the state the data model is already in, or isn't.

The cost is an agent that's confidently wrong, which is worse than one that says it doesn't know. The honest question, before pointing Agentforce at marketing data: can a human analyst get a coherent, complete answer about a customer from the unified profile today? If not, neither can the agent. (See the Data Cloud architecture gotchas.)

5. An agent that can act needs the same guardrails as an automation that can send

The moment an agent can do more than answer — trigger a journey, update a record, send a message — it's an automation with a non-deterministic trigger. Every guardrail you'd put on a Send Definition or an Automation applies: who approved the action, what's the blast radius, where's the audit trail, how do you stop it. The novelty of the interface doesn't suspend the discipline.

The cost is an agent taking an action at scale that no human signed off on. The question: for every action the agent can take, what's the approval gate, the rate limit, and the kill switch — the same three you'd demand of any automation that touches a customer?

6. Calling an external LLM from a CloudPage puts a third party in your send path — and maybe your data in their logs

When you call an external model from SSJS — to generate copy, classify a reply, summarize a record — you've added a network dependency to a surface that may run at send time, and you may be sending customer data to a third party that logs it. Both matter. The latency one breaks pages; the data one is a compliance exposure that a DPA either covers or doesn't.

7. An external model call at render time is a single point of failure for the whole page

HTTPRequest in SSJS is synchronous and it can time out. If a CloudPage or an email's render-time logic calls an external model and the provider is slow or down, the page hangs or errors for the visitor — the model's latency is now your page's latency, and the model's outage is your outage. Render time is the worst place to discover a provider's p99.

The cost is a broken page tied to someone else's uptime. The question: does this call happen at render time (where every failure is visible to the user) or ahead of time in an Automation (where a failure is yours to catch and retry)? Default to ahead-of-time. (See calling external AI from CloudPages.)

8. A non-deterministic model in a deterministic send is a QA problem you can't sample your way out of

Email QA assumes the same input produces the same output: you preview, it looks right, you send. An LLM breaks that assumption — the same prompt can produce different copy on two calls, and the one bad output is the one a sample of three previews won't catch. Generated copy that's wrong, off-brand, or unsafe doesn't announce itself in a spot check.

The cost is a generated message you never previewed reaching a real contact. The question: is the model's output constrained and validated before it ships — a fixed set of options, a human approval, a content filter — or are you trusting a sample to represent a distribution?

9. The model's cost scales with your send, and there's no built-in throttle

An external model call billed per token, fired once per contact in a million-send journey, is a million calls — and the bill, and the rate limit, scale with the audience. Marketing Cloud won't warn you; it'll happily fire the SSJS a million times. The provider's rate limit will start returning errors partway through, leaving a send half-personalized.

The cost is a runaway bill or a half-failed send, discovered after the fact. The question: is the per-contact cost and the rate limit modeled against the actual audience size, with a throttle and a fallback for when the limit hits? (Generate ahead of time, cache the result, and the per-send cost goes to zero.)

10. "The AI did it" is not an answer a regulator or a client accepts

Whether it's Einstein selecting an audience, Agentforce taking an action, or an external model writing copy, the accountability doesn't move to the model. A discriminatory audience, a wrong claim in generated copy, an action taken on the wrong customer — the consultancy and the client own it, the same as any other production decision. The model is a tool the team is responsible for, not a party that shares the blame.

The cost is reputational and sometimes legal, and it lands on the human team regardless of which surface produced it. The question, before any AI surface touches a customer: who is accountable for what it produces, and can they explain and defend it? If the answer is "the AI," the answer is missing.

The throughline across all ten: AI in Marketing Cloud doesn't suspend the disciplines the platform already taught. Einstein scores need the same skepticism as any model — garbage history, garbage score. Agentforce needs the same clean data model as any analyst, plus the action guardrails of any automation. External AI needs the same treatment as any external dependency — latency budget, failure handling, a data agreement — plus the QA discipline a non-deterministic output demands. The novelty is in the capability, not in the engineering judgment.

Closing

These ten are the AI-in-Marketing-Cloud failures Cleon has seen bite hardest, or watched coming. The shared theme is the one the platform always teaches: the easy path makes the demo work, and the durable path makes it survive a million sends. Trusting a thin-history Einstein score, auto-shipping a generated subject line, calling a model at render time, firing a per-token call once per contact with no throttle — none is hard in the moment, and each is a post-mortem once it's live.

The discipline that prevents most of them is the same one that prevents the SQL and Config versions: a human who's accountable for what the AI produces, the data model checked before the agent reads it, and the external call treated like the external dependency it is.

If an AI-in-Marketing-Cloud gotcha bit your team and isn't here, write to hello@wearecleon.com — we add it, with credit.

Reference: