Debugging grounding: tracing a bad answer to its retrieval
A grounded answer came back wrong and you have to fix it. The move that makes that possible is the one most teams skip: pull the chunks retrieval actually returned before you touch the prompt — you cannot fix what you cannot see. The symptom-driven playbook for five ways grounding goes wrong — confidently wrong, recall miss, ranked-low, stale, wrong-user chunk — each with what to inspect, the fix, and the retrieval eval case that stops it coming back.
A grounded answer came back wrong and someone is waiting on the fix. The instinct is to open the prompt and start rewording — and it is the wrong instinct, because the answer is two systems compounded and you are about to tune the half that probably didn't fail. The move that changes everything is the one most teams skip under pressure: when a grounded answer is wrong, inspect what retrieval actually returned before you touch the prompt. You cannot fix what you cannot see (principle 2). Pull the exact chunks the model was handed for the failing run and the bug almost always names itself — the right chunk was missing, or buried, or stale, or never belonged to this user. Without that, you are rewriting a prompt against a story you made up about what the model read.
This page is the operational companion to grounding gotchas and retrieval quality: the gotchas name how grounding fails and the quality page is the bar it clears, and this is the "now fix it" loop you run when a specific answer is wrong in production. It is symptom-driven: find the failure that matches what you're seeing, inspect retrieval at the layer the section points to, apply the fix, and — the step that separates debugging from whack-a-mole — turn the failing query into a retrieval eval case so the regression can't return. None of this is toolkit-specific: grounding on Agentforce retrievers over the Data 360 profile, or an external RAG pipeline over your own documents, fails in the same five shapes, and the retrieved chunks are where you catch all of them.
The loop: inspect retrieval → reproduce → eval-driven fix
Every debug session below runs the same three steps, in this order:
- Inspect retrieval. Pull the exact chunks returned for the failing run — the text, the rank each landed at, the source each came from, and the query as retrieval actually saw it. If you can't, you have an observability gap to fix before anything else; grounding you can't inspect is grounding you debug by superstition (principle 11). The single question that splits every failure below: was the right chunk even in the set, and where?
- Reproduce. Run the same query against retrieval until you can make the failure happen on demand. A retrieval miss you can't reproduce is one you can't confirm you fixed — and a fix you can't reproduce passing is a guess wearing a green check.
- Eval-driven fix. Add the failing query to the retrieval eval set before you change anything —
query → the chunk that should come back— watch it fail, apply the fix, watch the right chunk return at a usable rank, and run the whole set to confirm you didn't break a neighbor. The eval case is what makes the fix permanent; without it, the same miss walks back in on the next re-chunk, re-embed, or index change and nobody notices (principle 3).
It's confidently wrong
How to recognize it. The answer states something false, fluently, and presents it as certain. This is the most dangerous grounding failure because it looks exactly like a correct answer — the only tell is that the content is wrong, and someone may act on it before anyone notices.
What to inspect. Suspect retrieval before the model. Pull the chunks the model was handed this run and check one thing first: was the right chunk even in the set? If it wasn't, stop — this is a retrieval problem, not a model one, and no prompt edit recovers a fact the model never received (grounding gotchas gotcha 4, principle 2). If the right chunk was in the set but the answer is still wrong, the failure is either position (covered below) or genuinely the model, and now you know which. Read the retrieved chunks before you read the prompt.
The fix. If the right chunk was missing, fix the index, the chunking, or the query — not the prompt. The query came back empty or wrong because of something upstream: a chunk split that destroyed the fact, an embedding model that doesn't resolve this corpus, a query phrased nothing like the source. Fix the layer that lost the chunk (the rest of this page is the per-symptom version of that), and make "I don't have that" a path the system is allowed to take — a grounded system that can't say "it's not in the knowledge base" will invent every time retrieval comes back empty. If a human analyst couldn't answer the question from the same retrieved chunks, the bug is the grounding, not the model — it inherited the gap.
Make it stick. Add the query to the retrieval eval set with the chunk that should answer it, and assert that chunk now comes back. Score it on the retrieval — did the right chunk return — not on how confident the final answer sounds, because confidence is exactly what this failure fakes. Include at least one query whose right answer is "not in the knowledge base," so a regression that brings the confident hallucination back fails loudly.
The right chunk never came back (recall miss)
How to recognize it. You inspected the set and the chunk that answers the question simply isn't in it — not buried, not low, absent. The corpus contains the answer; retrieval never surfaced it. This is the failure underneath most confidently-wrong answers, which is why it gets its own section.
What to inspect. Walk it back up the pipeline (see chunking and embeddings). Three suspects, checked in order. First, chunking: find the source document and look at where the boundary fell — a split through the middle of the fact leaves both halves unretrievable, and a chunk too small to carry its own context never lands near the query. Second, embedding fit: a general model on a corpus thick with codes, jargon, or domain terms maps the chunk and the query to vectors that don't sit near each other. Third, the query itself: phrased in language far from how the source is written, or carrying an exact token (a SKU, an error code, a surname) that semantic search quietly drops.
The fix. Match the fix to the suspect the inspection indicted (retrieval quality is the lever catalogue). If the boundary split the fact, re-chunk on natural seams and add a little overlap; if a chunk lost its context, Contextual Retrieval prepends it back. If the embedding model doesn't fit the domain, that's the choice to revisit — and remember query and documents must share one model, so changing it means re-embedding the corpus. If an exact term is missing, add hybrid search so BM25 catches the literal token semantic search drops. If the query is vague or full of pronouns, add query rewriting before retrieval. Pull the cheapest lever that moves recall first — a larger k or better boundaries often fixes what you were about to build a re-ranker for.
Make it stick. Add the query and its correct chunk to the retrieval eval set and assert recall: the chunk now comes back, anywhere in the returned set. Recall is the first thing to lock, because nothing downstream — no re-ranker, no prompt — recovers a chunk that was never retrieved. The eval is what proves the re-chunk or the hybrid-search add fixed the real miss and didn't quietly drop a chunk that used to land.
The right chunk came back but ranked too low (position)
How to recognize it. You inspected the set and the right chunk is there — at position 9 of 10, or anywhere near the bottom of the window. Recall is fine; the fact was retrieved. But the model under-weighted it, because it competes with the distractors above it for attention and a long context buries what sits low. A right chunk ranked low is a near-miss, not a hit.
What to inspect. Read the rank of the correct chunk against what sits above it. The chunks ranked higher are the distractors stealing the model's attention — usually plausible-but-irrelevant matches that scored well on raw similarity. Confirm the shape: the answer-bearing chunk is present and low, and the top few slots are taken by chunks that don't actually answer the question. That's a precision-and-position problem, not a recall one, and it has a different fix.
The fix. Add re-ranking (retrieval quality). Retrieve a wider candidate set cheaply, then run a re-ranker — a model that scores each candidate against the query for relevance — and reorder so the right chunk lands on top. The first pass casts wide for recall; the re-ranker buys precision and position. The fact was already there; you're just moving it to where the model will actually use it. Metadata filtering helps too — narrowing the pool before ranking means fewer distractors competing for the top slots in the first place.
Make it stick. Add the query to the retrieval eval set and assert position, not just recall: the right chunk comes back near the top of the window, not merely somewhere in it. Position regressions are silent — a re-chunk or an embedding change can quietly push the right chunk back down while recall still passes — so the assertion has to check where the chunk landed, or the bug walks back in looking healthy.
The answer is stale
How to recognize it. The answer is correct against the index and wrong against reality — it reflects a price, a policy, or a record as it was, not as it is. It's the hardest grounding failure to catch, because retrieval looks perfectly healthy: the right chunk came back, ranked well, and the model grounded on it faithfully. The chunk itself is just out of date with the source everyone else is reading.
What to inspect. Compare the retrieved chunk's text against the live source. If they disagree, the index is lagging — it still holds the embedding of the old text because no re-index has run since the source changed (grounding gotchas gotcha 3). Then inspect the refresh path: what is supposed to trigger a re-index when this source changes, when did it last run, and how stale is this source allowed to get? A vector index is a cache of your corpus — wrong the moment the source changes, right again only after a refresh.
The fix. Fix the freshness contract, not the prompt. Decide how an edit reaches the index, how long the lag is allowed to be per source, and which sources change often enough that a nightly rebuild is already too stale — then wire the trigger that holds it. Where a full re-embed of the corpus on every edit doesn't scale, update incrementally: re-embed only what changed. The goal is that an edit to the source reaches the index inside the staleness you decided you can tolerate, with a way to notice when it doesn't.
Make it stick. A retrieval eval case alone can't catch staleness, because the eval's labelled chunk goes stale with the index. Add a freshness check instead: a fixed source you edit on a schedule, then assert retrieval returns the new text within the lag the contract promises. That turns "the index silently drifted from the truth" into a failing test the day the refresh path breaks, instead of a wrong answer a user finds for you.
A wrong-user chunk surfaced
How to recognize it. Retrieval returned something the user was never allowed to see — a chunk from a document they can't open through the front door, surfaced through the agent. The answer may even be correct; the failure is that this user got it at all. It looks like a helpful answer and it is a data-exposure incident.
What to inspect. Check where the permission filter runs — and confirm it's at retrieval, not the UI. Pull the chunks returned for this user and ask whether each one belongs to a source they have access to. If a restricted chunk made it into the set, the index flattened your corpus into one searchable space and flattened the permissions with it: the retriever returned a chunk regardless of who was asking, semantic search acting as a confused deputy (grounding gotchas gotcha 8). Inspecting only the UI hides this — the chunk was already exposed the moment retrieval returned it.
The fix. Filter by the requesting user's permissions before ranking, on metadata you indexed for exactly this — so a document the user can't open is also a chunk the retriever won't return to that user. Enforcing visibility only at the interface and trusting the index to be discreet is how grounding becomes an exfiltration path. On the Agentforce side, retrievers running inside the Salesforce security model honor the running user's sharing rules by construction; an external pipeline has to build that filter itself, and it is not optional.
Make it stick. Add a permission case to the retrieval eval set: a query that would surface a restricted chunk, run as a user who lacks access, asserting that chunk is unreachable. This is the one eval case where the right outcome is the chunk not coming back. Without it, a re-index that drops the permission metadata, or a filter that regresses, re-opens the exfiltration path silently — and a security regression is exactly the kind that ships unnoticed until someone reads what they shouldn't.
The throughline
Five failures, one method underneath: pull the chunks retrieval actually returned, find which of the five shapes you're looking at — missing, low, stale, wrong-user, or a genuine model miss — fix the layer the chunks actually indict, and lock the fix behind a retrieval eval case so it can't quietly come back. The reason guessing fails and inspecting wins is that a grounded answer is two systems compounded, and the one most teams reword is almost never the one that broke. Every failure above is a place the demo's single happy retrieval diverged from production's retrieve-from-everything, and the retrieved chunks are where you find the divergence instead of theorizing about it. The same habit runs through a full agent run — see debugging agents, which traces the whole loop the same way.
If a grounding failure bit your team and isn't here, write to hello@wearecleon.com — we add it, with credit.
Related
- Grounding gotchas — the failure modes this page operationalizes, each with the question to answer first
- Retrieval quality — the metrics and levers these fixes pull, measured separately from the answer
- Chunking and embeddings — the upstream inputs a recall miss usually traces back to
- What is grounding — the vocabulary the symptoms are named in
- Agentforce retrievers — inspecting retrieval inside the Salesforce security model
- External RAG — inspecting retrieval in a pipeline over your own documents
- Grounding Style Guide — the bar that keeps these debug sessions from happening twice
- Debugging agents — the same method for a full agent run, not just the retrieval
- AI Engineering principles — ground before you generate (2), evaluate before you ship (3), trace everything (11)
Reference: