Anthropic Prompt Caching in Production: A Practical Cost-Reduction Playbook
Prompt caching is the largest single cost lever available on the Claude API, and most teams are still leaving it on the floor. Not because it is hard to enable, but because the patterns that make it work are different from the patterns that make a normal LLM call work. This playbook is the version we wish someone had handed us before our first invoice cycle.
Why prompt caching is the biggest cost lever you have
If you look at a real production Claude workload, the input side of the bill is almost always dominated by content that does not change between calls. A system prompt with policies and tone instructions. A tool schema with a dozen function definitions. Few-shot examples. A retrieved document that the user is asking three follow-up questions about. The user turn at the bottom of the request is often the smallest part of the payload, and yet without caching you pay full input rate for every byte above it on every single call.
Prompt caching flips that. Instead of treating each request as independent, the server stores the prefix you have marked as cacheable and lets subsequent requests skip the recomputation. The economics are not subtle. A cache read costs 0.1x the base input rate. That is not a discount, that is a different order of magnitude. A team that goes from no caching to well-placed caching on a high-volume agent typically sees input spend drop by 70 to 90 percent without changing the model, the prompt content, or the output quality.
The lever is real. The reason it goes unused is that prompt caching only pays off when the prefix is byte-identical across calls, and most application code does not naturally produce byte-identical prefixes. Once you internalize that constraint, the rest of the playbook is mechanical.
What prompt caching actually does
The Anthropic Messages API exposes prompt caching through a single mechanism: a cache_control field that you attach to a content block. When the request reaches the server, the system looks at every block up to and including the marker, hashes the byte representation of that prefix, and either:
- creates a new cache entry containing the encoded form of that prefix, billed at the cache write rate, or
- finds an existing entry that matches the prefix exactly, skips the work, and bills at the cache read rate.
The cache_control field takes a single shape today:
{"type": "ephemeral"} # default 5-minute TTL
{"type": "ephemeral", "ttl": "1h"} # 1-hour TTL
You can place markers in three regions of the request: on the system array, on entries inside the tools array, and on content blocks inside messages. A request can carry up to four markers, and each one creates a separate breakpoint. On a subsequent call the server picks the longest matching breakpoint, so it is fine to place markers conservatively, knowing that the longest stable prefix wins.
Two non-obvious facts shape every decision below:
- Prefix match is byte-exact. A trailing space, a key reordering, a different floating-point literal in the system prompt, or a model identifier swap all produce cache misses. Whatever generates the cacheable prefix has to be deterministic.
- There is a minimum prefix size. Below roughly 1024 tokens (2048 for Haiku) the marker is ignored and you are billed as if no caching were requested. Caching very small system prompts is a no-op.
Pricing math: 50K-token system prompt, 100 calls
The fastest way to internalize the savings is to do the arithmetic on a realistic shape. Take a 50,000-token system prompt (policies, tone, schema description, few-shot examples) on Claude Sonnet, and assume the agent gets called 100 times in a 5-minute window. Anthropic's published rates apply:
| Model | Base input | 5-min write (1.25x) | 1-hour write (2x) | Cache read (0.1x) |
|---|---|---|---|---|
| claude-opus-4-8 | $5.00 | $6.25 | $10.00 | $0.50 |
| claude-sonnet-4-6 | $3.00 | $3.75 | $6.00 | $0.30 |
| claude-haiku-4-5 | $1.00 | $1.25 | $2.00 | $0.10 |
All values are per million input tokens. Live, current numbers including any gateway discounts are at /api/pricing; this section uses Anthropic's first-party rates so the multipliers are visible.
No caching, Sonnet, 100 calls, 50K tokens each.
100 calls x 50,000 tokens = 5,000,000 input tokens
5,000,000 / 1,000,000 x $3.00 = $15.00
5-minute cache, Sonnet, 100 calls. The first call writes the cache; the next 99 read it.
Write: 50,000 / 1,000,000 x $3.75 = $0.1875
Reads: 99 x 50,000 / 1,000,000 x $0.30 = $1.4850
Total: $1.6725
That is an 89 percent reduction on the cached prefix. The user-turn portion of each request is unaffected and still bills at the standard input rate, but in most workloads the user turn is a small minority of total input tokens.
1-hour cache, same shape.
Write: 50,000 / 1,000,000 x $6.00 = $0.30
Reads: 99 x 50,000 / 1,000,000 x $0.30 = $1.485
Total: $1.785
Inside a 5-minute window, the 1-hour TTL is strictly more expensive because of the higher write multiplier. Its value comes entirely from amortizing across longer time gaps. Which leads directly to the next question.
5-minute vs 1-hour: where the break-even sits
The choice between TTLs is a function of two variables: how many reads the prefix will get before it expires, and how long the gaps between reads typically are.
Compare the cost difference between the two writes, and the cost difference between a read and a fresh full-input billing:
- Extra write cost for choosing 1-hour over 5-minute:
0.75x base inputper cached token. - Cost saved by a cache read versus a full input billing:
0.9x base inputper cached token.
So the 1-hour TTL pays for itself if you get one additional read beyond the 5-minute window that the 5-minute cache could not have served. In the 50K-token Sonnet example, the extra write cost is 50,000 / 1,000,000 x ($6.00 - $3.75) = $0.1125. A single 1-hour read at $0.015 against a missed full-input call at $0.15 saves $0.135. The 1-hour cache is in the black after one extra hit.
Practical defaults that work for most workloads:
- Interactive chat or coding agents with bursts of activity inside a few minutes: 5-minute TTL on the system prompt and tools.
- Background pipelines, document review, batch jobs where calls are spaced out by tens of minutes but the same prefix recurs across hours: 1-hour TTL.
- Long retrieved documents a user is iterating on across a working session: 1-hour TTL on the document block, 5-minute TTL on conversation turns.
The mistake to avoid is choosing the 1-hour TTL "to be safe" on a high-frequency workload. You pay the larger write multiplier on every fresh entry, which on chronically cache-cold prefixes (per-user state, request-specific data) actively raises the bill.
Code patterns that work
The Anthropic Python SDK accepts cache_control on the structured content blocks. The patterns below assume you have set base_url="https://buzzai.cc" on the client; the API surface is identical to first-party Anthropic, and every cache directive is forwarded unmodified.
1. System prompt and tools, single session
The most common starter pattern. The system prompt is a single text block, the tools array is short and stable, and you place one marker at the end of each. Two breakpoints, both stable across the agent's lifetime.
from anthropic import Anthropic
client = Anthropic(
base_url="https://buzzai.cc",
api_key="sk-...", # BUZZ key
)
SYSTEM_PROMPT = """You are an internal compliance assistant.
[... 50K tokens of policies, tone rules, schema definitions ...]"""
TOOLS = [
{
"name": "lookup_customer",
"description": "Fetch customer record by id.",
"input_schema": {
"type": "object",
"properties": {"id": {"type": "string"}},
"required": ["id"],
},
},
# ... more tool definitions ...
]
# Mark the last tool with cache_control to cap the tools-array prefix.
TOOLS[-1] = {**TOOLS[-1], "cache_control": {"type": "ephemeral"}}
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
tools=TOOLS,
messages=[
{"role": "user", "content": "Look up customer 12345 and summarize their last 3 tickets."}
],
)
print(resp.usage)
# Usage(input_tokens=42, cache_creation_input_tokens=51234,
# cache_read_input_tokens=0, output_tokens=187)
On the second call within the TTL, with the same system prompt and tools array, you will see cache_creation_input_tokens=0 and cache_read_input_tokens=51234. The user turn at the bottom continues to bill at the standard input rate.
2. Multi-turn conversation
For a chat agent, the prefix you want to cache grows over time: turn 1 is stable by turn 2, turns 1-3 are stable by turn 4, and so on. The right pattern is to attach cache_control to the last assistant turn from the prior round, which makes everything up through that point cacheable.
def build_messages(history, new_user_input):
messages = []
for i, turn in enumerate(history):
block = {"type": "text", "text": turn["text"]}
# Mark the most recent assistant turn as a cache breakpoint.
if i == len(history) - 1 and turn["role"] == "assistant":
block["cache_control"] = {"type": "ephemeral"}
messages.append({"role": turn["role"], "content": [block]})
messages.append({"role": "user", "content": new_user_input})
return messages
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}],
messages=build_messages(history, "follow-up question..."),
)
This gives you up to three of the four available markers (system, last assistant turn, current user turn if needed) and lets the server pick the longest matching prefix on each call. As the conversation grows, the cache savings grow with it, because each new turn extends a prefix that was already cached on the prior call.
3. Long document, full content cached
For a document Q&A or code review workload, the document is the cacheable asset and the user turn is what changes. Put the document in its own content block at the start of the user message and mark it directly.
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
system=[{"type": "text", "text": "You are a careful technical reviewer."}],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": large_document_text, # 80K tokens
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
{
"type": "text",
"text": "Summarize the security implications in section 4.",
},
],
}
],
)
The first turn writes the document at the 1-hour rate. Every subsequent question against the same document, for an hour, reads at 0.1x. This is the pattern that produces the most dramatic dollar savings on long-context workloads, because the cache write is amortized across an entire reading session.
Through a gateway: what changes, what does not
If you call BUZZ AI Gateway instead of api.anthropic.com, prompt caching behaves identically from the application's point of view. The gateway is a transparent forwarder. Every byte of the request body, including cache_control markers, ordering of system blocks, ordering of tool definitions, and message content, is sent upstream without modification. The response, including the usage object with cache_creation_input_tokens and cache_read_input_tokens, is returned to the client as-is.
Concretely, the contract is:
- BUZZ does not normalize whitespace, reorder JSON keys, strip null fields, or rewrite system prompts. Byte-exact prefix matching is preserved.
- BUZZ does not inject its own system content, tool definitions, or guidance preamble. Prefix length stays whatever you sent.
- BUZZ does not silently substitute models.
claude-sonnet-4-6reaches Anthropic asclaude-sonnet-4-6, which keeps cache entries scoped correctly. - BUZZ operates with zero data retention. Bodies are not persisted; only billing metadata (model, token counts, timestamps) is kept. Caching state lives upstream at Anthropic, not at the gateway.
The point worth repeating: a gateway that rewrites requests breaks caching. A gateway that forwards bytes does not. The full lineup of supported models, including the cache-eligible ones used in this article, is at /models, and the live cost numbers including cache multipliers are at /api/pricing.
For Claude Code specifically, the install script at /sh/claudecode.sh configures the CLI to route through the gateway with one command:
curl -fsSL https://buzzai.cc/sh/claudecode.sh | bash
Claude Code's own caching of system prompts and tools continues to work because the request shape it sends is preserved end to end.
Common mistakes that quietly break caching
Most cache-related bugs are not visible: the request succeeds, the response is correct, the bill is just higher than it should be. The patterns below cover the failures we see most often.
Non-deterministic system prompts
If your system prompt is built by string-formatting in the current date, a request id, a user id, or any other per-request value above the cache marker, every call gets a fresh cache entry. Move volatile values below the cache marker, into the user turn, or into a separate non-cached block. A useful sanity check is to log a hash of the cached prefix on every call: if the hash changes between calls that should reuse cache, you have found the leak.
Reordered tool arrays or system blocks
Tools are often loaded from a registry that returns them in dictionary order, which is not stable across Python versions or across processes. Sort the tool array deterministically before sending and always slot the cache marker on the last element. The same applies to multi-block system arrays.
Missing the marker on later calls
Caching is opt-in per call. If only the first request includes cache_control and later requests omit it, the server has nothing to look up against and bills at the full input rate even though an entry exists. Place the marker in shared code, not in the first call site.
Prefix below the minimum length
Markers on prefixes shorter than 1024 tokens (2048 for Haiku) are ignored. If you cannot see cache_creation_input_tokens on the first call after enabling caching, your prefix is probably too short. Either consolidate more content above the marker or accept that this prompt is not large enough to benefit.
Wrong order of blocks
The marker caches everything above it, not below. If you place cache_control on the user's current question, you are asking the server to cache content that changes every call, which never produces a hit. Markers belong on stable blocks: system, tools, prior turns, and the early portion of a long document.
TTL drift
Reads against a 1-hour entry from a request that asks for a 5-minute TTL still work, but you cannot create a new 1-hour entry by mixing TTLs. Pick a TTL per breakpoint and stick to it across the lifetime of the cache.
Model swaps
Cache entries are scoped per model. Switching from claude-sonnet-4-6 to claude-opus-4-8 for "harder" requests is a complete cache miss for those requests, which means the higher per-token rate is paid against an uncached prefix. If you do A/B between models, expect both sides to maintain their own caches.
An observability checklist
Caching that you cannot see is caching you cannot tune. The minimum useful instrumentation is three counters per call, all available from the usage object on every response:
input_tokens— uncached prefix plus the new turn, billed at full input rate.cache_creation_input_tokens— tokens written to a new cache entry, billed at the write multiplier.cache_read_input_tokens— tokens served from cache, billed at the 0.1x read rate.
Aggregate these by route or feature, then plot the ratio cache_read / (cache_read + input + cache_creation) over time. A healthy production agent settles above 0.7 once warmed. Anything below 0.3 on a high-volume route is a tuning opportunity, almost always traceable to one of the mistakes in the previous section.
BUZZ exposes the same usage fields in the response body without modification, so the metric pipeline you would build against direct Anthropic usage works unchanged when you switch base_url to the gateway.
FAQ
Do I need to do anything special on the BUZZ side to enable caching?
No. Caching is an upstream feature governed entirely by the request body. If you set cache_control correctly, the gateway forwards it and Anthropic handles the rest.
Will caching change model behavior?
It should not. The encoded prefix is replayed as if it had been computed fresh; the model sees the same context. If you observe behavior drift after enabling caching, the most likely cause is an unrelated content change introduced at the same time.
What happens if Anthropic evicts the cache entry early?
The next request that would have been a hit is billed as a fresh write. There is no error, no warning, just a slightly higher bill on that call. Eviction is rare within the stated TTLs but not guaranteed.
Can I cache across users?
Yes, as long as the prefix is byte-identical and the calls are made under the same account. A shared system prompt and tools array used by every user benefits from a single cache entry, regardless of which user triggered the write.
Does caching change how rate limits work?
Cache reads still count against TPM (tokens per minute) for input, just at the much lower priced rate. RPM (requests per minute) is unaffected by caching.
Is caching available on every Claude model?
It is supported on the current Sonnet, Opus, and Haiku families, including claude-opus-4-8, claude-sonnet-4-6, and claude-haiku-4-5. Minimum prefix sizes vary; Haiku requires 2048 tokens before a marker takes effect.
Can I combine caching with the OpenAI-compatible endpoint?
Caching is an Anthropic Messages API feature. To use it, call Claude through the native Anthropic surface (BUZZ at https://buzzai.cc with the Anthropic SDK). The OpenAI-compatible endpoint at https://buzzai.cc/v1 is useful for portability but does not expose cache_control.
Should I cache few-shot examples?
Almost always yes. Few-shot examples are typically the longest stable region of a prompt and are reused identically across calls. Place them in the system block above the cache marker.
What if my system prompt is just barely under the minimum?
Pad it with content that is genuinely useful: explicit format examples, edge-case instructions, or a glossary. Padding for the sake of padding does not improve quality and often hurts it.
Conclusion
Prompt caching is the rare optimization that costs almost nothing to enable, requires no model change, and routinely cuts input spend by an order of magnitude. The reason it goes unused is operational, not technical: producing byte-identical prefixes across calls demands a small amount of discipline in how prompts are assembled, and most code does not start out that way.
The playbook is short. Mark the system prompt and tools array. Pick a TTL based on how often the prefix will recur. Keep volatile values below the marker. Watch cache_read_input_tokens and tune until it dominates. If you call Claude through BUZZ, every cache directive is forwarded unchanged, every usage field comes back unmodified, and the savings show up on the gateway bill at /api/pricing the same way they would on a direct integration. Models supported, including all cache-eligible variants, are listed at /models; Claude Code users can adopt the gateway in a single shell command from /sh/claudecode.sh.
If you do nothing else after reading this, instrument the three usage counters on your highest-volume Claude route. The gap between what your bill is and what it could be will be visible inside an hour.
Published: 2026-05-22
Last reviewed: 2026-05-22