Skip to main content

Prompt caching: stop re-paying for the same prefix

The large static preamble — system prompt, tool defs, examples, a long grounding doc — re-sent on every call gets re-processed every time, and at production volume that is a bill nobody approved. Prompt caching fixes it: mark a stable prefix with cache_control and later calls reuse the model's processing instead of re-paying for it. The mechanic — caching runs in order over tools, then system, then messages, up to and including the marked block, with up to 4 breakpoints, automatic or explicit. The economics stated honestly: a cache read costs about 0.1x a base input token while a cache write costs more than one, so caching wins on a reused prefix and can lose on a one-shot call. And the design rule that makes it pay — stable content first, the variable input last.

Reference·Last updated 2026-06-03·Drafted by Lira · Edited by German Medina

A long system prompt, a fat set of tool definitions, a block of fixed examples, a static grounding document — sent once in a demo, that preamble is free. Sent on every call at production volume, it is the bill: the model re-processes the identical leading tokens on every single request, and at a million calls a month the prompt that was fine in the demo is a cost line nobody approved. That is gotcha 10 in prompting gotchas, and prompt caching is the fix for it. This page is the mechanic — what caching actually does, the economics stated as arithmetic rather than a free win, and the structural rule that decides whether the cache pays off at all (principle 6 — cost and latency are features).

This is a reference for the Claude API surface. The same idea exists across model platforms under different names; the discipline below transfers, the exact knobs and ratios are Anthropic's.

What prompt caching is

You mark a stable prefix of the prompt with cache_control. The model caches its processing of everything up to that breakpoint, and any later call that shares the same prefix reuses the cache instead of re-processing those tokens from scratch. The leading content — the part that does not change from call to call — is processed once and then read back, rather than paid for again on every request.

The thing being cached is the model's processing of the prefix, not just the text. That is why the saving is real and not bookkeeping: re-sending ten thousand identical tokens means the model would otherwise redo the work of reading them every time. Caching means it did that work once, and subsequent calls that open with the same ten thousand tokens skip straight past them.

The mechanic

Caching covers the prompt in order — tools first, then system, then messages — up to and including the block you mark with cache_control. Everything before the breakpoint is cached as one contiguous prefix; everything after it is processed fresh on every call. You can set up to 4 cache breakpoints, which lets you cache more than one stable segment (for example, the tool definitions and a long static document as separate cached regions).

There are two ways to place the breakpoint:

  • Automatic caching — the breakpoint moves forward as a conversation grows, so the stable, accumulated history stays cached without you re-marking it on every turn. Useful for multi-turn conversations where the prefix keeps extending.
  • Explicit breakpoints — you place cache_control on specific blocks for fine control over exactly what gets cached and where the cached region ends. Useful when you know which segment is stable and want to pin the boundary yourself.

A cached entry has a lifetime: a minimum of 5 minutes (standard) or 1 hour (extended), and using the cache refreshes that window. Let it go idle past the lifetime and the entry expires; the next call re-processes the prefix and writes a fresh cache. So caching pays when calls that share a prefix arrive close enough together to keep the entry warm — a busy endpoint, not one request an hour.

The economics — read it as arithmetic

Caching is not a free win, and treating it as one is how teams end up paying more. The ratios decide it:

  • A cache read costs about 0.1x the base input token price — roughly a tenth, a saving of about 90 percent on the cached portion. This is the win, and it is large.
  • A cache write costs more than a normal input token — about 1.25x for the 5-minute cache, about 2x for the 1-hour cache. Writing the cache is a premium over just sending the tokens once.

So the trade is plain: you pay a premium once to write the prefix into the cache, and then pay about a tenth each time you read it back. Caching wins when a stable prefix is reused enough to amortize that write — the read savings have to add up past the write premium. On a one-shot call you never hit again, caching loses: you paid the write premium and never collected the read discount. The decision is arithmetic — how many times will this exact prefix be reused before it goes cold — not a switch you flip because caching sounds like savings. (This is principle 6 made concrete: cost is a number you model against real volume, not an afterthought.)

Cache-friendly structure: the design rule

The mechanic above only pays if the prompt is shaped for it, and that shape is one rule: put the stable content first and the variable content last. Caching covers a contiguous prefix, so the longest unchanging run has to be at the front for the cache to cover it.

  • Stable, first: the system prompt, the tool definitions, fixed examples, a static knowledge or grounding block — everything that is identical across calls. This is the prefix you mark and reuse.
  • Variable, last: the user's input, the per-request data, anything that changes call to call. This sits after the breakpoint and is processed fresh every time, as it must be.

Put a single per-request token early — a timestamp, a user id, the question itself — and you have broken the prefix: nothing after that point is shared across calls, so nothing after it can be cached. The cache only sees an identical leading run, so the engineering is to make that run as long as possible by ordering everything stable ahead of everything variable.

[ tools ]            ← stable: same every call
[ system prompt ]    ← stable
[ fixed examples ]   ← stable
[ static doc ]       ← stable        ──► cache_control breakpoint here
─────────────────────────────────────────  (cached prefix ends; fresh below)
[ user's question ]  ← variable: changes every call

This is the structural payoff of treating the prompt as engineered context rather than a string you assemble in whatever order is convenient. The same instinct that orders a prompt for salience (see system prompts and instructions) also orders it for the cache: stable scaffolding up top, the one thing that changes at the bottom. You get the salience and the caching from the same decision.

The throughline

Prompt caching is the answer to a specific failure — a large static preamble re-paid on every call — and it is an arithmetic decision, not a free switch. The read discount is real (about a tenth of the base price) and the write premium is real (more than one token), so caching pays when a stable prefix is reused enough to clear the write, and costs you when it isn't. What makes it pay at all is structure: stable content first so the cacheable prefix is long, the variable input last so the cache is never broken by a per-request token near the top. Get the order right and a system that was correct-but-uneconomical becomes correct-and-cheap; ignore it and the cache has nothing contiguous to hold onto. Cost is a feature you design in, the same way you design grounding and evals — model it against your real volume before you build, not after the bill arrives.

Related

Reference: