Cost and latency: the levers, in order of force

A demo serves one request and the cost is rounding error. The same system at a million calls a month turns cost and latency into numbers someone in finance and someone on call both have to answer for — and by then the cheap moves are harder to retrofit than they were to design in. This page is the lever board for the two, ordered by how much force each one has. None of it is exotic; all of it is the difference between an AI feature that pencils out and one that quietly doesn't (principle 6 — cost and latency are features, designed in, not discovered on the invoice).

This is a reference for the Claude API surface — the off-platform levers, where you control the model, the request, and the budget directly. Agentforce abstracts some of these (it picks models and manages context for you), but the discipline below is identical: the cheapest, fastest request is the one shaped to be cheap and fast on purpose. Where a lever has its own mechanic page, this page is the production framing and points there for the how.

Pick the model first — it's the biggest lever

Before caching, before batching, before any request-level tuning: which model you call is the single largest lever on both cost and latency. A smaller model is faster and cheaper per token by design; a larger one reasons harder and costs more. Get this decision right and the rest is fine-tuning. Get it wrong — reach for the most capable model on a task a smaller one handles — and you overpay on every call, forever, for reasoning you didn't need.

The three tiers, and what each is for:

Model	Relative speed	Relative cost	What it fits
Haiku	Fastest	Lowest	Simple, high-volume work — classification, extraction, short summaries, routing. The default for anything you run a lot where the task is well-defined.
Sonnet	Middle	Middle	Most production workloads — the balanced default when a task needs real capability but not the hardest reasoning. Where most agents and assistants land.
Opus	Slowest	Highest	The hardest reasoning — multi-step analysis, complex code, the problems where a weaker answer is worse than a slower one. Reach for it where the task earns it, not by reflex.

The rule Anthropic states plainly: Haiku for simple tasks, Sonnet for most production workloads, Opus for the most complex reasoning. The trap is defaulting up "to be safe" — the safe-feeling choice is the expensive one, multiplied by your call volume. Match the model to the task, measure, and move down a tier wherever the smaller model still passes your evals (which is why evaluation comes before this — you can only downgrade a model safely if you can prove the smaller one still works).

Prompt caching — stop re-paying for the static prefix

If your prompt carries a large stable preamble — a system prompt, tool definitions, fixed examples, a long grounding document — and you send it on every call, the model re-processes those identical leading tokens every single time. Prompt caching marks that stable prefix once and reuses the processing on later calls: up to 90 percent off the cached portion and up to 80 percent lower latency on it, because the model reads the prefix back instead of re-reading it from scratch.

As a production lever the framing is: caching is the answer to a large, reused, static context, and it pays when calls that share that prefix arrive close enough together to keep the cache warm. The mechanic — cache_control, the breakpoint, the read-vs-write economics that decide whether it pays on a given workload — lives in prompt caching. Here it's enough to know it's the first request-level lever to reach for once the model is chosen, and that it stacks with the others.

Batch processing — about half off, when you don't need the answer now

A lot of AI work doesn't need an immediate response: overnight enrichment, bulk classification, a large eval run, generating summaries across a dataset. For that work, the Message Batches API processes requests asynchronously at about a 50 percent discount on both input and output tokens, with most batches finishing in under an hour (and a longer ceiling for the rest).

The decision is about latency tolerance, not capability — it's the same models, the same quality, half the price, in exchange for "soon" instead of "now." So the production question is simply: does a human or a downstream system need this response in real time? If not, batch it. The slip teams make is running everything through the live, synchronous path out of habit and paying full freight on work that could have waited an hour.

`max_tokens` — a hard cap and a runaway guard

max_tokens sets a hard limit on the length of the generated response. It does two jobs at once. First, output tokens cost more than input tokens, so capping output is a direct cost control on the most expensive part of a response. Second, it's a runaway guard: it bounds the worst case so a prompt that goes off the rails — a model that won't stop, a loop that keeps generating — can't run up an unbounded bill or hang a request indefinitely.

It's a blunt instrument: when a response hits the cap it's cut off where it lands, possibly mid-sentence, so set it to the real ceiling your output needs rather than a number that truncates good answers. But every production call should carry one. An uncapped max_tokens is an open-ended liability on both the cost and the latency of a single request.

Streaming — same cost, far less perceived latency

Streaming doesn't change what you pay or how long the full response actually takes — it changes what the user feels. With streaming on, the model sends tokens as it generates them, so the user sees the answer begin almost immediately instead of staring at a spinner until the whole response lands. The total generation time is the same; the time-to-first-token is what drops, and for anything a human reads in real time that's the number that governs whether the thing feels responsive or broken.

So streaming is the lever for perceived latency, and it's nearly free to flip on for any user-facing, conversational surface. It does nothing for a batch job or a backend call no human watches — there, total latency is the only latency that matters. Don't confuse the two: streaming makes a slow response feel fast, it does not make it cheap or actually faster.

Token budgets and the Usage and Cost API — you can't hold a ceiling you don't measure

Every lever above is a guess until you measure the result. The Usage and Cost API is the cost-management surface: it gives programmatic, granular access to your organization's token usage and spend — broken down by model, workspace, and service tier, with cost reported in USD — so you can reconcile against the bill, attribute spend, and set alerts on spending thresholds instead of finding out at month-end.

The discipline this enables is the one that closes the loop: set a token budget — an explicit ceiling per feature, per tenant, per run — measure actual consumption against it, and alert when you approach the line. This is the direct answer to the "no cost ceiling" failure (production gotchas covers it as a named trap): a system shipped without a budget and without alerting can 10x its bill on a traffic spike or a prompt regression and the first signal anyone gets is the invoice. A measured ceiling turns cost from a surprise into a number you steer.

The throughline

Cost and latency are not things that happen to a production AI — they're things you design, in roughly this order of force: pick the right model (the biggest single lever on both), cache the stable prefix you re-send, batch the work that can wait, cap output with max_tokens, stream where a human is watching, and measure all of it against a token budget with alerting so the ceiling is real. Each lever stacks; none of them is exotic. The team that treats these as features — modeled against real volume before launch — ships an AI that pencils out. The team that treats them as an afterthought ships one that works in the demo and gets switched off the first month the bill arrives. The off-platform Claude API gives you every lever directly; Agentforce hands you some of them pre-pulled — either way the question is the same, asked before you ship rather than after: what does this cost at volume, and how fast does it feel?

Prompt caching — the mechanic behind the caching lever: cache_control, breakpoints, and the read-vs-write economics
Context windows — what fills the window and why the static part is the part worth caching
What is evaluation — why you can only move down a model tier safely if you can prove the smaller model still passes
Tracing and monitoring — the observability side of the same measure-it discipline
AI Engineering principles — cost and latency are features (6), context is a budget (10)
Production Style Guide — the pre-ship gate a cost ceiling and latency budget are rows of
Deploying to production — where the model-tier and rollout choices get made

Reference: