Grounding gotchas: how RAG fails in production

A RAG demo retrieves the one document you tested, ranks it first, and answers the one question you asked. Production retrieves from everything you have — the whole corpus, stale and clean and contradictory all at once — on questions nobody scripted, and the wrong chunk does not announce itself. It comes back ranked first and the model answers from it fluently. A grounding failure does not look like an error; it looks like a confident, well-written, wrong answer, and someone acts on it.

Ten gotchas that turn a working retrieval demo into a system that confidently misinforms. They are the same shape as the ones that kill agents: the demo makes the easy path look like the whole path, and production is everything the easy path left out. Each is paired with the question to answer before you ship and the cost of getting it wrong. None of this is toolkit-specific — grounding built on Agentforce retrievers over the Data 360 profile, or an external RAG pipeline over your own documents, inherits every one of these. The retrievers and the external pipeline are complementary instruments you compose to fit the corpus and the security model, not rival camps; the failure modes are shared either way.

The gotchas

1. Bad chunking — the boundary itself can destroy the meaning

Retrieval works on chunks, and how you cut the documents decides what can ever be found. Chunks too big dilute the signal — the one relevant sentence sits buried in three pages of context, and the embedding averages it into noise. Chunks too small lose the context that made the sentence mean anything — a number with no unit, a clause with no subject. And the boundary itself is a third failure: a split through the middle of a table, a procedure, or a definition can leave both halves useless.

The cost of getting it wrong is a corpus that contains the answer but cannot surface it — the fact is in there, sliced so it never ranks. The question to answer before you index: does each chunk carry enough context to stand on its own, and do your boundaries fall on natural seams — sections, paragraphs, complete rows — rather than a fixed character count that cuts through meaning?

2. Embedding mismatch — the wrong model, or an inconsistent one, misses silently

Semantic search is only as good as the embeddings under it. A general-purpose embedding model on a domain it does not know — dense product codes, clinical terms, internal jargon — maps text it cannot distinguish to nearby vectors, and the search returns near-misses that read plausible. Worse is inconsistency: embedding the documents with one model and the queries with another, or changing the model and not re-embedding the corpus, so query vectors and document vectors no longer live in the same space.

The cost is a search that fails without failing — it returns results, they are just the wrong ones, and nothing flags it because the cosine distance is happily small. The question: is your embedding model a fit for this domain's vocabulary, and is every vector in the index — documents and queries alike — produced by the same model and version, so a query and the chunk that answers it can actually land near each other?

3. Stale index — the agent grounds on yesterday's truth

The source of truth changes; the index does not change with it unless something makes it. A price updates, a policy is rewritten, a record is corrected — and the vector index still holds the embedding of the old text until a re-index runs. Until then, retrieval surfaces the stale chunk, the model grounds on it, and the answer is confidently out of date with the system everyone else is reading.

The cost is an answer that is correct against the index and wrong against reality — the hardest kind to catch, because the retrieval looks healthy. The question: what triggers a re-index when a source changes, what is the maximum staleness you can tolerate per source, and how would you even notice the index had drifted from the truth?

4. Empty or wrong retrieval, answered anyway — the model fills the gap from parameters

Retrieval can come back empty, or come back with chunks that have nothing to do with the question, and a model with no instruction to stop will not stop. It answers from its training instead — fluently, confidently, and with no signal that it just made the whole thing up. This is principle 2 (ground before you generate) failing at the seam, and it is the same failure the agents gotchas call out from the agent side: a clever prompt over no grounding is a confident hallucination, and it sounds exactly like a correct answer.

The cost is the worst output a grounded system can produce — an ungrounded answer wearing the costume of a grounded one, indistinguishable from a real citation until someone checks. The question: what does the system do when retrieval returns nothing or returns low-relevance chunks — does it abstain, ask, or escalate — and have you actually tested the empty-retrieval path, or only the happy one where the right document was always there?

5. No retrieval eval — you score the answer but never the retrieval

It is natural to evaluate the final answer and stop there, but the answer is two systems compounded: retrieval that finds chunks, and generation that reasons over them. Score only the end and you cannot tell a generation problem from a retrieval problem — a good answer might have come from the wrong chunks by luck, and a bad one might be a good model starved of the right context. You can't fix what you don't measure (principle 3), and you are not measuring the half that fails most.

The cost is debugging blind: you tune the prompt for a week to fix what was a chunking bug all along, because nothing told you the right chunk never made it into the context. The question: do you measure retrieval on its own — did the chunk that contains the answer actually get retrieved, and at what rank — separately from whether the final answer was good, so you know which half to fix when quality drops?

6. Over-retrieval — stuffing the top fifty chunks degrades reasoning

When retrieval feels unreliable, the tempting fix is to retrieve more — top-fifty instead of top-five, "just in case the answer is in there somewhere." It backfires. More context is not better context: the relevant chunk now competes with forty-nine distractors for the model's attention, reasoning quality drops as the signal drowns, latency climbs, and every call costs more for a worse result. This is principle 10 (context is a budget, not a bucket) on the retrieval side — context rot is a grounding bug as much as an agent one.

The cost is a system that gets worse the more you feed it, for reasons that never show up as an error — just a slow, confusing decline in answer quality that correlates with the size of the context you so carefully grew. The question: what is the smallest number of chunks that reliably contains the answer for your corpus, and are you retrieving to that — tuned against your retrieval eval — rather than padding the context and hoping?

7. No "I don't know" path — the system can't say it's not in the knowledge base

A knowledge base has edges. Some questions fall outside it, and a grounded system has to be able to say so — "that is not in the knowledge base" — instead of reaching past the edge and inventing an answer that sounds like it came from inside. If "I don't have that" is not a path the system is allowed to take, then every out-of-scope question becomes a hallucination, because the only behavior you left it is to answer.

The cost is a system that is most confident exactly where it knows least — at the edges of the corpus, where it has nothing to ground on and answers anyway. The question: can this system decline — return "not in the knowledge base," route to a human, ask a clarifying question — and is declining a first-class, tested outcome rather than an accident the model falls into only when retrieval happens to be obviously empty?

8. Permission leakage in retrieval — the index returns chunks the user can't see

A vector index flattens your corpus into one searchable space, and unless access control rides along, it flattens the permissions with it. If the retriever can return any chunk regardless of who is asking, then a user who could never open a document through the front door can pull its contents out through the agent — semantic search as a confused deputy, reading restricted material aloud on their behalf.

The cost is a data-exposure incident wearing the face of a helpful answer — the agent did its job and surfaced exactly the chunk it was asked for, to someone who was never allowed to read it. The question: does retrieval filter by the requesting user's permissions before ranking, and can you prove a restricted chunk is unreachable through the agent for a user who lacks access to its source?

9. Re-index cost and latency ignored — "re-embed everything on every edit" doesn't scale

Embedding is not free and re-indexing is not instant. Embedding a large corpus is a real bill in API or compute, and re-embedding it is a real delay before the new content is searchable — minutes to hours on a corpus of any size. A design that re-embeds everything on every edit works on the demo's hundred documents and falls over on the production million, where it becomes either a cost line nobody approved or a freshness lag nobody planned for.

The cost is a grounding system that is correct and unaffordable, or affordable and stale — the trade-off you did not design becomes the trade-off you are stuck with. The question: what does a full re-index cost and how long does it take at your real corpus size, and can you update incrementally — re-embedding only what changed — rather than rebuilding the world every time one document moves?

10. Missing or hallucinated citations — the answer can't show its sources, or cites one it didn't use

The point of grounding is a traceable answer: this claim came from that chunk, and you can follow the link and check. Two failures break that trust. The answer shows no sources at all, so it is indistinguishable from an ungrounded guess and impossible to verify. Or — subtler and worse — it cites a source it did not actually use, a plausible-looking reference stapled on after the fact, so the citation itself is a hallucination and following it disproves the claim it was meant to support. Without traceable citations you cannot trust the answer or debug it (principle 11); you cannot replay what you cannot see.

The cost is grounding you cannot audit — an answer that may be right or may be invented, with no way to tell which from the outside and no trace to replay when it is wrong. The question: does every grounded claim carry a citation back to the chunk it actually came from, and have you verified the citations are real — that the cited chunk genuinely supports the claim — rather than trusting a model that will fabricate a reference as readily as a fact?

The throughline across all ten: grounding that ships is chunked to be findable, embedded consistently, kept fresh, evaluated on retrieval and not just the answer, budgeted in context, permission-aware, affordable to refresh, and traceable to its sources — or it is a demo that retrieved the one document you tested and has not yet met a question you didn't write. Every gotcha above is a place where the demo's single happy retrieval and the production reality of retrieving-from-everything diverge, and the wrong chunk is always waiting in the corpus you have not tested against.

Closing

These ten are the grounding failures Cleon has seen most often, across both Agentforce retrievers and external RAG. The discipline that prevents them is the same one that runs through all of AI engineering: the demo retrieves the one right document and the model reads it, while production has to surface the right chunk from everything you own and then ground a clean model on it — the retrieval-quality bar and a model worth grounding on. Get the retrieval-quality bar right and most of these never fire; skip it and you ship a system that misinforms with confidence. None of them is hard to prevent up front; all of them are expensive to discover live, because a grounding bug looks exactly like a correct answer until someone checks the source.

If you want the grounding mechanics in full, what is grounding is the vocabulary, chunking and embeddings is the retrieval pipeline part by part, and retrieval quality is the bar a retrieval system clears before it ships. The model the agent grounds on matters as much as the retrieval over it — the Data 360 agent-readiness check is where a clean model meets a safe agent.

If a grounding gotcha bit your team and isn't here, write to hello@wearecleon.com — we add it, with credit.

What is grounding — the definition before the gotchas
Chunking and embeddings — the retrieval pipeline that gotchas 1, 2, and 9 live in
Retrieval quality — the bar a retrieval system clears before it ships
Agent gotchas — the agent-side failures, including grounding gaps from the agent's view
Agentforce agents — grounding through Data 360 inside the security model
Data 360 agent-readiness check — the clean model the retrieval grounds on
Grounding Style Guide — the bar a grounded system clears before it ships
Debugging grounding — tracing a bad answer to its retrieval
AI Engineering principles — ground before you generate (2), evals (3), context is a budget (10), trace everything (11)

Reference: