What is grounding? The retrieval pipeline an answer stands on
Grounding is feeding the model real retrieved facts instead of letting it answer from training — and RAG, retrieval-augmented generation, is how it's built. The pipeline end to end: chunk → embed → store → retrieve → augment → generate, each stage in a sentence. The vocabulary the rest of this subcategory uses — chunk, embedding, vector store, semantic search, hybrid search, top-k, re-ranking — and the honest test for when you need retrieval at all: only when the answer lives in data the model wasn't trained on or that changes. Principle 2: ground before you generate.
An agent or a single prompt reasons over whatever you hand it. Grounding is the practice of handing it real facts — retrieved from your data, a database, a document set, an API — instead of letting it answer from what the model absorbed in training. The one-paragraph version of this lives in what is an agent; this page goes deep, because grounding is the foundation almost every other quality problem traces back to. An ungrounded model answers from its parameters and guesses confidently when those run out. A grounded model answers from what you put in front of it this run. That difference is principle 2 — ground before you generate — and it is the difference between an answer that is fluent and an answer that is true.
The standard way to build grounding has a name: RAG, retrieval-augmented generation. The word "augmented" is the whole idea — you augment the prompt with retrieved facts before the model generates. RAG is not a product or a library; it is a pipeline, six stages from raw documents to a grounded answer. This page walks that pipeline once, end to end, lays out the vocabulary the rest of this subcategory leans on, and draws the honest line for when you need retrieval at all. The depth of each stage lives in the sibling pages; here the job is the shape of the whole.
The pipeline, end to end
RAG splits cleanly in two. The first three stages happen ahead of time, when you index your knowledge. The last three happen at query time, every time someone asks a question. Confusing the two is where a lot of muddled RAG designs start.
- Chunk — split your source documents into passages small enough to retrieve and feed precisely. A whole 40-page PDF is too coarse to be a unit of retrieval; a single sentence is too fine to carry context. Chunking is the decision of how to cut, and it is more consequential than it looks — see chunking and embeddings.
- Embed — turn each chunk into an embedding: a vector, a list of numbers, that encodes the chunk's meaning so that passages about the same thing land near each other in vector space. This is what lets you search by meaning instead of by exact words.
- Store — load those vectors into a vector store (a vector index) so you can search them fast. The index is built once and reused on every query; rebuilding it is how new or changed documents get in.
- Retrieve — at query time, embed the user's question the same way, then pull the chunks whose vectors sit closest to it. This is the heart of RAG, and the stage that most often decides whether the answer is right — retrieval quality is its own discipline.
- Augment — inject the retrieved chunks into the prompt, alongside the question and the instructions, so the model has the facts in front of it. This is the "augmented" in RAG: the prompt the model sees is the question plus what retrieval found.
- Generate — the model reads the augmented prompt and writes the answer, grounded in the chunks you handed it rather than in training. Done right, you can trace each claim in the answer back to a chunk that supports it.
The split matters because the two halves fail differently and get fixed in different places. A bad answer that traces to what you indexed — wrong chunking, stale documents, gaps in coverage — is fixed ahead of time, in the first three stages. A bad answer that traces to what got retrieved this run — the wrong chunks came back, or none did — is fixed at query time, in retrieval. Knowing which half you are in is half the debugging.
When you need grounding — and when you don't
Here is the honest test, parallel to the one for agents: does the answer depend on facts the model wasn't trained on, or that change?
If yes, you need grounding. Anything about your data — a customer's order history, this quarter's policy, a document written last week, a record in your CRM — is by definition not in the model's training, and no amount of prompt cleverness conjures it. The model will either say it doesn't know or, worse, invent something plausible. Retrieval is the only honest way to put those facts in reach. The same goes for anything that changes faster than models are retrained: prices, inventory, status, the current state of any system of record.
If no, you don't. A self-contained task — one where everything the model needs is already in the prompt — needs no retrieval at all. Summarize this email. Classify this ticket. Rewrite this paragraph in a plainer register. Extract the dates from this contract. The input carries its own facts; there is nothing to go fetch. Bolting RAG onto a self-contained task adds latency, cost, and a new failure mode (bad retrieval) in exchange for nothing. The discipline mirrors the one for agents: reach for retrieval only when the answer genuinely lives outside the prompt, and not one step sooner.
The vocabulary
This subcategory leans on a small set of terms. Here they are once, plainly; you will meet each in depth as the subcategory goes on.
- Chunk — a passage of a source document, sized to be a unit of retrieval. Too big and it dilutes; too small and it loses context.
- Embedding — a vector that encodes a chunk's meaning, so that semantically similar text sits close together in vector space. Produced by an embedding model, which is a different model from the one that generates the answer.
- Vector store / vector index — the database that holds the embeddings and answers nearest-neighbour searches over them quickly. Sometimes a dedicated service, sometimes a feature of a database you already run.
- Semantic search — retrieval by meaning: embed the query, find the chunks whose vectors are nearest. It catches paraphrases and synonyms that exact-word search misses.
- Keyword / BM25 search — retrieval by term overlap, the classic lexical approach. It is unbeatable at exact matches — names, codes, error strings, IDs — exactly where semantic search can go soft.
- Hybrid search — running semantic and keyword retrieval together and merging the results, to get meaning and exact matches. In production this is usually the right default rather than either alone.
- Top-k — how many chunks you retrieve per query. Pull the top-5 and you hand the model five passages; pull too many and you drown the signal and burn context budget (principle 10); pull too few and you miss the chunk that held the answer.
- Re-ranking — a second pass that takes the retrieved candidates and reorders them by relevance with a more careful (and more expensive) model, so the best chunks land first and the merely-nearby ones fall away.
- Context window — the finite space in the prompt the retrieved chunks have to share with the instructions and the question. Retrieval doesn't get the whole window; it gets what's left after everything else, which is why top-k and re-ranking matter.
How this ties to agents
An agent reads the world through retrieval the same way a single grounded prompt does — its tools fetch, its answers stand on what comes back. Every grounding failure in this subcategory is therefore an agent failure too: when an agent is confidently wrong, the cause is far more often what retrieval handed it than the model that reasoned over it. That exact failure is gotcha 8 in agent gotchas — a clever prompt over empty or wrong retrieval is a confident hallucination that sounds identical to a correct answer. Grounding is upstream of it. Get retrieval right and a whole class of "the model is wrong" bugs disappears, because they were never the model's fault.
This is why the subcategory exists as a peer to agents rather than a footnote inside it. The model supplies fluency; grounding supplies the facts; and the line between an AI that is impressive in a demo and one that is correct on Monday morning runs straight through the retrieval pipeline.
Where to go next
From here, two directions into the depth. The ahead-of-time half — how to cut documents and what embeddings actually encode — is chunking and embeddings. The query-time half — why the wrong chunks come back and how to measure and fix it — is retrieval quality. And the failure modes before you hit them in production, the grounding equivalent of the agent gotchas, are in grounding gotchas.
The platform side splits the same way agents do, into complementary tools an engineer composes rather than a versus. Inside the Salesforce security model, grounding is built with Agentforce retrievers over the Data 360 profile — a clean, governed data model is the precondition, which is the Data 360 agent-readiness check. Outside it, grounding is a RAG pipeline over your own vector store, wired with LangChain and the Claude API. Both build the same pipeline this page walks; they differ only in where the data lives and who governs it.
Related
- Chunking and embeddings — the ahead-of-time half: how to cut documents, and what an embedding actually encodes
- Retrieval quality — the query-time half: why the wrong chunks come back, and how to measure and fix it
- Grounding gotchas — the failure modes — empty retrieval, stale index, confident hallucination — and how to catch them
- What is an agent — the system whose answers stand on grounding
- Agent gotchas — gotcha 8 is the grounding failure, seen from the agent side
- Grounding Style Guide — the bar a grounded system clears before it ships
- Debugging grounding — tracing a bad answer to its retrieval
- AI Engineering principles — ground before you generate (2), the system is the job (4), context is a budget (10)
Reference: