Prompt Caching
Anthropic's prompt cache lets repeated long contexts be billed at one-tenth the input price. The mechanism is dead simple from the outside: tag a content block with cache_control, and on subsequent calls those tokens count as cache_read_input_tokens instead of input_tokens. Knowing when, where, and how long to cache turns this from a nice optimization into a major cost lever.
What it is
A Claude request is, at the model level, a flat stream of tokens. The cache lets Anthropic memoize the prefix of that stream. The first time you send a tagged prefix, the model processes it and Anthropic writes the resulting attention state to an ephemeral cache. Subsequent requests that share the same prefix skip the compute and pay a small read fee instead.
The pricing model is straightforward:
| Token category | Relative cost |
|---|---|
| Standard input | 1.0× |
| Cache write (5-minute) | 1.25× (one-time) |
| Cache write (1-hour) | 2.0× (one-time) |
| Cache read | 0.1× |
| Output | 5.0× (varies by model) |
You pay a small premium on the first call, then 90% off forever (well, until the TTL expires). The arithmetic is favorable any time the cached prefix is reused at least twice within its TTL window.
cache_control directives byte-for-byte and bills based on the cached/non-cached token counts the upstream returns — it does not impose any additional cache surcharge.How cache_control works
You annotate one or more content blocks with cache_control. The marker says "everything from the start of the request up to and including this block is a candidate cache prefix". Anthropic builds a hash of that prefix and looks it up.
{
"model": "claude-haiku-4-5-20251001",
"max_tokens": 512,
"system": [
{
"type": "text",
"text": "<30 KB of stable instructions and context>",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "Question about the context above"}
]
}
Two important rules:
- Cache breakpoints are positional. The cached span runs from byte zero to the marked block. If you change anything earlier in the prompt, the cache misses. Stable, unchanging content goes first.
- You can mark up to four breakpoints. They form a layered cache, longest-match-wins. Use this to cache a stable system prompt, then a stable middle section, then a section that updates per-session, etc.
Cache hits show up in the response under usage:
"usage": {
"input_tokens": 12,
"cache_creation_input_tokens": 0,
"cache_read_input_tokens": 30000,
"output_tokens": 240
}
That cache_read_input_tokens: 30000 is the win. Those 30,000 tokens cost one-tenth what they would have cost as standard input.
5-minute vs 1-hour TTL
Anthropic offers two cache durations. The decision is almost always driven by inter-request gap.
| TTL | Write premium | Use when |
|---|---|---|
| 5 minutes (default) | 1.25× | Interactive sessions where the same context is hit multiple times within a few minutes (chat conversations, agent tool loops, IDE assistants). |
| 1 hour | 2.0× | Periodic batch jobs, scheduled report generation, or anything where the same context is reused across requests separated by minutes-to-an-hour. |
Specify the longer TTL like this:
"cache_control": {"type": "ephemeral", "ttl": "1h"}
The simple decision rule: if your typical re-use gap is under five minutes, take the default. If your re-use gap is between five minutes and an hour, the 1-hour cache is cheaper than re-creating a 5-minute cache. Above an hour, you are creating fresh caches every time and should re-think whether caching helps at all.
How BUZZ forwards cache_control
BUZZ does not interpret cache_control. It is forwarded byte-for-byte to the upstream channel. The cache itself lives in Anthropic's infrastructure, not in BUZZ. This has three implications worth understanding:
- Cache hit rate is identical to direct. Because BUZZ does not reshape the request, the prefix hash Anthropic computes is the same it would compute on a direct call.
- The cache is per-channel. A request routed through Bedrock and a request routed through Anthropic First-Party use independent caches. If your group has multiple channels for the same model, sticky routing is necessary to keep cache hit rate high. BUZZ supports priority pinning for exactly this reason.
- BUZZ cannot leak your cache. The cache is bound to the upstream account. Other BUZZ users do not share a cache namespace with you.
Practical patterns
Long stable system prompt. If your assistant has a 20K-token instruction and example block, mark it with cache_control. Every conversation turn after the first becomes near-free on the input side.
RAG with a stable retrieval set. If a session retrieves the same documents repeatedly, place those documents in the system prompt and cache them. The user message changes; the cache prefix does not.
Tool definition reuse. Long tools[] arrays can be cached. Place the cache breakpoint after the last tool definition so the entire tool catalog becomes a single cached prefix.
Multi-step agent loops. Each tool round-trip is a new request. With caching, the conversation history (often the longest part of the prompt) is paid once and read back on every iteration. The savings compound with conversation length.
See also
POST /v1/messages— wherecache_controlis sent- Transparent Forwarding — why hit rate matches direct
- Tool Use — caching tool catalogs
- Anthropic prompt caching playbook — production patterns and pitfalls