Chunking and embeddings: the inputs retrieval quality depends on
The two upstream levers retrieval quality stands on: chunking and embeddings. How you split a document — fixed-size, structural, semantic — and the chunk-size and overlap trade-offs that either preserve or destroy meaning, including Anthropic's Contextual Retrieval. What an embedding is, why the embedding model is a real choice (dimension, cost, latency, domain fit), and why query and document must share one model. Anthropic ships no first-party embedding model — you pair Claude with a provider. Get these wrong and no retrieval tuning saves you.
Retrieval gets the attention, but it can only ever return what you put in front of it. Two steps happen before a single query runs, and both decide the ceiling on everything after: you chunk the documents — split them into pieces — and you embed each piece — turn it into a vector that captures its meaning. Get either wrong and there is no retrieval tuning, no reranker, no clever prompt that recovers it. A sharp query over badly-chunked, badly-embedded data is lipstick on a miss (principle 2 — ground before you generate; principle 4 — the model is the easy part). This page is about those two upstream levers, in order, because they are where retrieval quality is won or lost.
This is a reference, not a recipe. The techniques below are the ones that hold up; the right dial settings depend on your documents, and finding them is what an eval set is for (see retrieval quality).
Why you chunk at all
You chunk because you cannot embed a whole document as one vector and expect it to be useful. An embedding compresses its input into a single point in space; the longer and more varied the input, the more meaning gets averaged away until the vector means nothing in particular. A query asking about one paragraph buried in a forty-page contract will not land near a vector that smeared all forty pages together. So you split the document into pieces small enough that each one is about something, and you embed those.
The whole game of chunking is one trade-off, run at the boundary of every chunk:
- Too big dilutes the embedding and wastes context. A chunk spanning three unrelated topics produces a vector that is near none of them, and when it does get retrieved it spends your context budget on the two-thirds the query did not ask about (principle 10 — context is a budget, not a bucket).
- Too small loses the context a chunk needs to make sense. A two-sentence fragment that says "the limit is 500 per day" is useless if the document never repeats what has that limit; the chunk no longer carries enough to answer anything on its own.
Most teams discover this trade-off by shipping the first number that came to mind — split every 1000 characters — and then wondering why retrieval is mediocre. The fix is not a magic chunk size. It is matching the split to the shape of your documents.
Chunking strategies
Three strategies, from blunt to sharp. They are complementary — most real pipelines mix them — not a ladder where the last one wins.
- Fixed-size — split every N tokens, usually with an overlap so a sentence cut at a boundary still appears whole in the next chunk. It is the simplest to build and the least aware of meaning: it will happily slice a table in half or end a chunk mid-clause. Fine as a baseline, and sometimes all that prose-shaped documents need.
- Structural — split on the document's own structure: by heading, by section, by paragraph, by the rows of a table. This respects the boundaries the author already drew, so a chunk tends to be about one thing because the writer made it about one thing. It is the strategy that pays off most on documents with real structure — docs, knowledge-base articles, contracts with numbered clauses.
- Semantic — split where the meaning shifts, detected by measuring when consecutive sentences stop being similar to each other. More expensive to compute and more involved to run, it can find boundaries neither fixed-size nor structure can see — a topic change mid-section that no heading marks. Reach for it when structure is thin and the fixed-size baseline is leaving meaning on the floor.
Overlap deserves its own line because it is the cheapest fix for the most common chunking bug. With zero overlap, a fact that straddles a boundary — a sentence whose subject is in chunk one and predicate in chunk two — is retrievable from neither. A modest overlap (a sentence or two repeated at each edge) buys back most of that loss for a small storage cost. It is not free, and stuffing huge overlaps to be safe just reintroduces the too-big problem, but a little is almost always right.
Contextual Retrieval: the technique worth knowing
Even a well-placed chunk loses the context of the document it came from. "The limit is 500 per day" embeds the same whether it is about API calls or login attempts, because the chunk alone does not say. Contextual Retrieval, a technique Anthropic published, fixes this directly: before you embed each chunk, you prepend a short, chunk-specific summary that situates it in its source document. The chunk is no longer a naked fragment; it carries its own context into the vector.
The contextual prefix is generated per chunk — typically 50 to 100 tokens — describing where the chunk sits and what it refers to. A bare chunk and its contextualized form make the idea concrete:
# Bare chunk (what naive splitting embeds)
The limit is 500 per day.
# Contextualized chunk (Contextual Retrieval prepends a summary, then embeds this)
This chunk is from the Rate Limits section of the Orders API reference,
describing the per-key ceiling on order-creation calls.
The limit is 500 per day.The contextualized version embeds to a far more useful point in space: a query about API rate limits now lands near it, where the bare version was ambiguous. Anthropic reported that contextual embeddings alone reduced the top-20 retrieval failure rate by 35 percent (and more when combined with contextual keyword search), which is a large win for a preprocessing step you run once. It costs an extra generation per chunk at index time — a real cost, but a one-time one, and prompt caching over the source document keeps it modest. When retrieval is missing chunks that are clearly relevant, this is one of the highest-leverage fixes available. See the reference link below for Anthropic's full write-up and numbers.
What an embedding actually is
An embedding is a vector — a list of numbers, often a few hundred to a couple thousand of them — that represents a piece of text as a single point in a high-dimensional space. The model that produces it is trained so that text with similar meaning lands near other similar text, and unrelated text lands far away. "Cancel my subscription" and "how do I end my plan" point to nearly the same place despite sharing almost no words; "cancel my subscription" and "the weather is nice" point to opposite corners. That is the whole trick retrieval runs on: embed the query, embed every chunk, and the chunks whose vectors sit nearest the query vector are your candidates (the top-k — see retrieval quality).
The number of dimensions — call it around 512, though models offer a range — is one knob. More dimensions can capture finer distinctions but cost more to store and compare across millions of chunks. It is a real engineering trade-off, not a "bigger is better" dial.
The embedding model is a choice
Picking an embedding model is a design decision with the same trade-offs as any other, and four axes carry most of it:
- Dimension — how long each vector is. Higher can mean finer semantic resolution; it also means more storage and slower similarity search at scale. Some models let you truncate to a shorter dimension when you would rather have speed than the last increment of accuracy.
- Cost — you pay per token embedded, and you embed your entire corpus once plus every query forever. On a large corpus this is a line item, not a rounding error (principle 6 — cost and latency are features).
- Latency — query embedding sits directly in your user-facing path. The query has to be embedded before retrieval can even start, so a slow embedding model is latency every user feels.
- Domain fit — a model trained on general web text may under-resolve a corpus thick with legal, medical, or domain jargon, where the distinctions that matter are exactly the ones general text blurs. Some providers ship domain-tuned models for precisely this.
One rule is not a trade-off but a hard constraint: the query and the documents must be embedded with the same model. Two different models produce vectors in two different spaces, and a distance computed across spaces is noise. If you change the embedding model, you re-embed the whole corpus — there is no mixing old document vectors with new query vectors. That re-embedding cost is part of the decision.
The throughline
Chunking and embeddings are upstream of retrieval, and upstream problems do not announce themselves — they show up downstream as "retrieval is bad" and send you tuning the wrong layer. A chunk split so it lost its context, or embedded by a model that does not resolve your domain, was a miss before the query ever ran. That is why this page comes before retrieval quality: you tune retrieval after the inputs are right, not instead of getting them right. When retrieval underperforms, walk it back to the source — are the chunks the right size and shape, do they carry their context, is the embedding model a fit for this corpus — before you touch top-k, reranking, or the prompt. The clever query is the last lever, not the first.
Related
- What is grounding — why an agent answers from what you hand it, not what it memorized
- Retrieval quality — the layer these inputs feed: top-k, reranking, and measuring whether retrieval works
- Grounding gotchas — the failure modes, including empty and wrong-chunk retrieval
- Tools and actions — retrieval as a tool an agent calls, and keeping its blast radius small
- Grounding Style Guide — the bar a grounded system clears before it ships
- Debugging grounding — tracing a bad answer to its retrieval
- AI Engineering principles — ground before you generate (2), the model is the easy part (4), cost and latency are features (6), context is a budget (10)
Reference: