Context windows: the token budget every call spends
The context window is every token the model can reference for a call — including its own response — and it is two things at once: finite and ordered. This is the operational depth under context engineering: the window is one budget that the system prompt, instructions, examples, retrieved facts, history, tool definitions, and user input all draw from, so more is not better; and it is ordered, so a key fact stranded in the middle gets under-weighted. How to manage a window that grows, why a million-token ceiling is still not a license to fill it, and the levers — summarize, drop, retrieve on demand, server-side compaction — that keep it honest. Principle 10: context is a budget, not a bucket.
The context window is everything the model can reference for one call — the system prompt, the instructions, the examples, the retrieved facts, the conversation so far, the tool definitions, the user's input, and the model's own response as it writes it. It is the model's entire field of view for that call, and it has two properties that govern everything you do with it: it is finite, and it is ordered. Finite means there is a hard ceiling on how many tokens fit, and every part above draws from that one ceiling. Ordered means position carries weight — the same fact lands differently depending on where in the window it sits. What is context engineering frames the window as the unit of control; this page is the operational layer beneath it — how the budget is spent, how it is managed as it grows, and why a big ceiling changes less than it looks.
This is a reference, not a recipe. The right numbers — how much to keep, when to compact, how big a window to provision — depend on your task and your model. What does not change is the shape of the trade-off: every token you add to the window costs something, in space, in money, in latency, and in the model's attention. The job is spending that budget where it earns its place (principle 10 — context is a budget, not a bucket).
The window is a budget, not a bucket
Picture the window as a fixed amount of room. Every part of the call moves into it, and they all share the same floor space:
- The system prompt — the standing role, task, and rules. Set once, holds the whole call. The depth of writing it is system prompts and instructions.
- Instructions — the specific ask for this turn, in this format, under these constraints.
- Few-shot examples — worked input-output pairs that show the shape you want instead of describing it.
- Retrieved context — facts pulled in at query time so the model answers from your data, not its training. Its own discipline is what is grounding.
- Conversation history / state — what came before, plus any scratch state carried forward. The part that quietly grows.
- Tool definitions — the typed functions the model may call, each with a name and description it reads before it can choose.
- The user input — the actual question for this turn.
- The model's response — the output counts too. The window has to hold room for the answer it is about to write, so a near-full input window leaves no room to respond.
The reflex is to treat the window as a bucket: pour in everything that might help and let the model sort it out. It does not work that way. Past a point, more context makes the answer worse — the signal the model needs drowns in the context you added "just in case," and the model attends to the noise. This is context rot, and it is the same trap grounding names as over-retrieval one layer down: stuffing the top fifty chunks degrades reasoning exactly as stuffing the history does (see retrieval quality). The fix is the same on both sides — curate, do not accumulate. Spend the budget on the parts that earn it, and cut the parts that don't, because every token competes with every other for the model's attention.
The window is ordered: lost in the middle
The second property is the one that surprises people. The window is not a flat bag where every token weighs the same. Position carries weight, and the pattern is consistent: models attend most reliably to the start and the end of the window, and least reliably to the middle. A critical instruction or a key fact stranded in the middle of a long window gets under-weighted — present in the tokens, but effectively faint. This is the "lost in the middle" effect, and it means where you put something is part of how well it works.
The practical move is to put what matters where the model weights it. Anthropic's long-context guidance is concrete about this: place the bulk material — the long document, the retrieved corpus, the history — toward the top, and put the actual instruction and the question near the end, closest to where the model starts writing. The same facts arranged two ways produce two different answers; the arrangement is a lever, not a formatting detail.
Managing a window that grows
A single call is easy to fit. The problem is the call that is part of a conversation, or an agent loop that runs for many turns — the history grows every turn, and left alone it grows until it crowds out the instruction that matters or hits the ceiling outright. A window that grows unbounded is a system that gets slower, costlier, and less reliable the longer it runs. So a growing window has to be managed, and there are three honest moves:
- Summarize — replace a long stretch of history with a short summary of it. You keep the meaning and pay a fraction of the tokens. The cost is that summarizing is lossy: detail the summary dropped is detail the model no longer has.
- Drop — discard the turns that no longer matter. The oldest exchanges in a long conversation are often dead weight; cutting them is the cheapest reclaim there is, as long as you cut what is genuinely spent and not what the current turn still leans on.
- Retrieve on demand — keep the bulk out of the window and pull only the relevant piece back in when a turn needs it, rather than carrying everything forward just in case. This is grounding applied to the conversation's own history.
For a long-running or agentic loop, the managed approach is server-side compaction — the platform summarizes and prunes the accumulating context for you as the loop proceeds, so a multi-turn agent does not blow its window on turn forty. Treat it as the primary strategy for loops that run long: it is the same summarize-and-drop discipline above, automated and applied continuously, so you are not hand-managing the history of every long conversation.
A bigger ceiling is not a license to fill it
The ceilings have grown, and they are genuinely large now. Claude Sonnet 4 and Sonnet 4.5 support a one-million-token context window — available in beta, for higher usage tiers — which is room for a substantial corpus, a long codebase, or a deep conversation in a single call. It is tempting to read a ceiling that big as permission to stop thinking about the budget. It is not.
Principle 10 still holds at a million tokens, because the costs that make the budget real do not disappear when the ceiling rises — they scale with what you put in. A larger window you actually fill costs more per call (you pay for the tokens), runs slower (the model processes more before it answers), and still suffers context rot and lost-in-the-middle (a relevant fact competes with more distractors, and the middle is bigger). A big ceiling buys you the option to include more when more genuinely earns its place; it does not change the arithmetic that more usually doesn't. Provision the window the task needs, and fill it with what the task needs — not with everything the ceiling allows.
One thing that does help is that the model can now watch its own budget. Newer models — Sonnet 4.6, Sonnet 4.5, and Haiku 4.5 — have context awareness: they can track the tokens remaining through a conversation and pace accordingly, rather than running blind toward the ceiling. It is a real aid for long agentic loops, where a model that knows it is running low can wrap up or compact instead of hitting a wall. It does not replace your budgeting — it is the model holding up its end of the same discipline you are managing from the outside.
The throughline
The window is finite and it is ordered, and almost everything operational about prompting follows from those two facts. Finite is why you curate instead of accumulate, manage history instead of letting it grow, and read a million-token ceiling as an option rather than an instruction — context is a budget, not a bucket (principle 10). Ordered is why you put the instruction where the model weights it, near the question, instead of trusting a fact buried mid-window to carry. Get both right and the window works for the model; get either wrong and you have spent tokens, latency, and money to make the answer worse. This is the depth under what is context engineering — that page names the window as the unit; this one is how you spend it. And when the part you keep is large and stable across calls, the lever that makes re-sending it affordable is prompt caching, where the budget meets the bill.
Related
- What is context engineering — the frame this page operationalizes: the window is the unit of control
- System prompts and instructions — the highest-leverage tokens in the window, and where to place them
- Prompt caching — the cost lever for a large, stable context you send repeatedly
- Structured output — constraining the response, which spends from the same budget
- Prompting gotchas — the context-shaped failures, including the buried instruction and the bloated history
- Retrieval quality — over-retrieval is context rot on the grounding side: the same budget, one layer down
- AI Engineering principles — context is a budget not a bucket (10), the model is the easy part (4), cost and latency are features (6)
Reference: