Skip to main content

Retrieval quality: measuring and improving what the model gets

Retrieval quality is a separate thing from answer quality, and you have to measure it on its own. This page splits the two — how to score whether the right chunk came back at all and how high it ranked, with a small retrieval eval set built query-to-chunk — then walks the levers that improve it: hybrid search, re-ranking, metadata filtering, query rewriting, and chunk/k tuning, each with where it fits and what it costs. The throughline: a grounded system is only as good as its retrieval, and retrieval is only trustworthy once it's measured.

Reference·Last updated 2026-06-03·Drafted by Lira · Edited by German Medina

When a grounded answer is wrong, there are two suspects, and most teams interrogate the wrong one first. They edit the prompt. But the answer can be wrong because the model reasoned badly or because retrieval handed it the wrong chunk to reason over — and those are different failures with different fixes. A model reasoning faithfully over the wrong three chunks will defend a wrong answer beautifully (principle 2). So before you touch the prompt, you check what retrieval actually surfaced. That check only exists if you measure retrieval as its own thing, separate from the answer it feeds.

This is the core discipline of grounding: retrieval quality is separate from answer quality, and you must measure it on its own. This page is in two halves. First, how to measure retrieval — the metrics that tell you whether the right chunk came back and how high it ranked, and the small eval set that makes those metrics real. Then the levers that improve it, each with where it fits and what it costs, in the shape the toolkit is meant to be read: complementary instruments you compose, not a camp to pick (principle 7).

Measure retrieval separately

Answer quality is downstream of retrieval quality, so a single end-to-end score hides which half is failing. You can have a flawless prompt and a useless answer because retrieval missed; you can have a mediocre prompt that still answers well because retrieval was clean. Score them apart, and a wrong answer immediately splits into a question with a clear next move: did the right chunk come back at all, and did it come back high enough for the model to use.

Two metrics carry most of that.

  • Recall — did the chunk that should answer the query get retrieved at all, anywhere in the returned set? This is the first failure to rule out, because nothing downstream can recover from a chunk that was never surfaced. If recall is the problem, no amount of prompt work helps; the right fact never reached the model.
  • Precision and position (precision@k) — of what came back, how much is relevant, and where did the right chunk land? A correct chunk retrieved at position 9 of a 10-chunk window is barely better than a miss: it competes with eight distractors for the model's attention, and on a long context it can be ignored entirely. Position matters because the model weights what it sees unevenly; a right chunk buried low is a near-miss, not a hit.

The instrument that makes both measurable is a retrieval eval set: a small fixed list of query → the chunk(s) that should come back. You write the queries from real questions — the first handful that bit you in testing are worth more than any synthetic set, because they encode how your corpus actually fails — and you label, by hand, which chunk is the correct answer for each. Then you run retrieval over the set and score recall and position automatically. This is principle 3 applied to retrieval specifically: if you can't evaluate retrieval, you can't improve it, you can only guess. And it runs before you tune the prompt (principle 2) — score the retrieval first, because a prompt tuned against broken retrieval just memorizes the breakage.

Improve retrieval

Once you can measure it, the levers below each move a specific number — recall, precision, or position. None is free, and reaching for the most elaborate one first is the slide-deck instinct; the production instinct is to find which metric is failing and pull the cheapest lever that moves it. Each lever here comes with where it fits and what it costs.

Hybrid search

Run semantic (embedding) search and keyword search — BM25 — together, and merge the results. Semantic search matches on meaning and misses exact strings; a product SKU, an error code, a person's surname, a rare acronym are the things a pure embedding search quietly drops because they're semantically unremarkable. BM25 catches the literal term. Hybrid gets both.

When it fits: any corpus with exact terms that have to match literally — codes, IDs, names, jargon, product numbers — which is most real corpora. If your recall misses cluster on queries containing a specific token, this is usually the first lever to pull.

Trade-off: you now run and maintain two retrieval paths and a merge step (commonly a fusion that interleaves both rankings), so there's more to build and tune than a single semantic index. The cost is engineering and a little latency, not much runtime money — and it's typically the first move that moves the number most when exact terms are in play.

Re-ranking

A second pass: retrieve a wider top-N candidate set cheaply, then run a re-ranker — a model that scores each candidate against the query for relevance — and reorder, keeping only the best few. The first pass optimizes for recall (cast wide, don't miss); the re-ranker optimizes for precision and position (put the right chunk on top). This is the lever that most directly fixes a "right chunk came back, but ranked too low" failure.

When it fits: when recall is fine but position is not — the correct chunk is in your top-N but not near the top, so the model under-weights it. It's high-value precisely because position is where many grounded systems quietly lose: the fact was there, just not where the model would use it.

Trade-off: the re-ranker is another model call on every query, so it adds latency and cost to each retrieval. You spend that on purpose for the precision gain. Anthropic's Contextual Retrieval ↗ work reports the size of the prize: combining contextual embeddings with contextual BM25 cut the top-20-chunk retrieval failure rate by 49%, and adding re-ranking on top took the cut to 67% — substantially fewer failed retrievals, paid for in the extra pass.

Metadata filtering

Before semantic search runs, narrow the candidate pool with structured filters — source, date, document type, language, and especially permission. Don't search the whole corpus and hope the right chunk wins on similarity; constrain to the slice that's even eligible, then search within it. This is also where access control belongs: a chunk the current user isn't allowed to see should be filtered out before retrieval, not retrieved and then hopefully suppressed.

When it fits: any corpus with structure worth exploiting — multiple sources of differing trust, time-sensitive content where stale documents are wrong rather than just old, or per-user permissions. It cuts the search space, which lifts both precision (fewer irrelevant candidates to rank against) and, often, latency.

Trade-off: the filter is only as good as the metadata, so the cost is upstream — chunks have to be tagged correctly at ingestion, and a wrong or missing tag silently filters the right chunk out. A filter that's too aggressive becomes a recall problem you'll only catch with the eval set. Filtering on permission is non-negotiable regardless; the rest you add where the corpus has structure to lean on.

Query rewriting and expansion

Fix the query before it hits the index. Real user queries are vague, under-specified, full of pronouns ("what about the second one?"), or phrased nothing like the documents that answer them. A rewrite step — usually a small fast model — turns the raw query into something retrieval can actually match: resolving references from the conversation, expanding an acronym, adding synonyms, or splitting a compound question into parts you retrieve for separately.

When it fits: conversational systems where queries carry context from earlier turns, and any corpus where users phrase questions in language far from how the source is written. If your recall misses cluster on short or vague queries rather than on the corpus itself, the query is the thing to fix, not the index.

Trade-off: another model call before retrieval — latency and cost — and a rewrite can drift from what the user meant, turning a vague-but-answerable query into a confident-but-wrong one. Keep the rewriter cheap and narrow, and put it in the same eval set: a rewrite step you can't score is one more place the system fails silently.

Chunk-size and k tuning

The two knobs that sit underneath all of the above: how big each chunk is, and how many (k) you retrieve. Both are set when you build the index and the retrieval call, and both are covered in depth in chunking and embeddings — here the point is that they're the baseline you tune first, because every other lever operates on top of whatever these produce. Chunks too large bury the answer in noise; too small and the answer is split across chunks that don't all get retrieved. k too low risks missing the right chunk; k too high drags distractors into the window.

When it fits: always, as the first thing to get roughly right before adding machinery. If recall is failing, try a larger k or different chunk boundaries before you build a re-ranker; the cheap knob often moves the number the expensive lever was going to.

Trade-off: the k knob is where the temptation to over-retrieve lives, and it backfires. Raising k "to be safe" floods the context with marginally relevant chunks, and past a point that degrades the answer — the signal the model needs drowns in the chunks you added just in case. That's context rot (principle 10): context is a budget, not a bucket, and precision matters as much as recall. The right k is the smallest one that clears your recall bar on the eval set, not the largest one you can afford.

The throughline

A grounded system is only as good as its retrieval, and retrieval is only trustworthy if it's measured. Those two sentences are the whole page. The measurement half (principle 3) is what turns "the answer's wrong, let's edit the prompt" into "recall is fine, position is bad, add a re-ranker" — a diagnosis instead of a guess. The improvement half is a set of levers you pull against a number, not a stack you assemble for completeness. And the trap underneath all of it is over-retrieval: stuffing the window to feel safe is the same mistake as stuffing the prompt, and it costs you the same way (principle 10). Precision earns its place next to recall — a grounded answer needs the right chunk and needs it where the model will actually use it.

Related

  • What is grounding — why retrieval is the thing that makes an answer true, and where it sits in a grounded system
  • Chunking and embeddings — the index these levers operate on: chunk boundaries, embedding choice, and the k knob
  • Grounding gotchas — the failure modes retrieval quality trades against, and the costs of getting them wrong
  • Debugging agents — the same measure-the-failing-layer-first habit, applied to a full agent run
  • Agent gotchas — grounding gaps as one of the ways an agent dies after the demo
  • Grounding Style Guide — the bar a grounded system clears before it ships
  • Debugging grounding — tracing a bad answer to its retrieval
  • AI Engineering principles — ground before you generate (2), if you can't evaluate it you can't ship it (3), context is a budget (10)

Reference: