What does Anthropic prompt caching actually cache?

Anthropic caches the prefix of a request on the server side. When you mark a block with cache_control, the system stores the encoded representation of everything up to and including that block. A subsequent request that begins with the exact same bytes can skip recomputation for that prefix and pay the cache read rate, which is 0.1x the base input rate, instead of the full input rate. The cache is keyed on byte-exact prefix match, model, and account.

How much does prompt caching save in real numbers?

On Sonnet at $3 per million input tokens, a 50,000-token system prompt costs $0.15 per call without caching. With a 5-minute cache, the first call costs $0.1875 (1.25x write multiplier) and each subsequent call within the TTL costs $0.015 (0.1x read multiplier). Across 100 calls in a 5-minute window, the bill drops from $15.00 to $1.6725, a 89 percent reduction on that prefix.

When should I pick the 1-hour cache instead of 5 minutes?

Use the 1-hour TTL when the time gap between requests is longer than 5 minutes but the prefix is reused often enough to amortize the higher write cost. The break-even is roughly: pay 0.75x extra base input on the write (2x minus 1.25x), then save 0.9x base input on each read versus paying the input rate. For a 50K-token prefix on Sonnet, the 1-hour write costs $0.225 more than the 5-minute write, which pays back after about 2 to 3 additional reads beyond the 5-minute window.

Does cache_control survive when calling Claude through a gateway?

It does when the gateway is transparent. BUZZ AI Gateway forwards the request body byte for byte, including every cache_control marker on system blocks, tool definitions, and message content. The response is also forwarded unchanged, so cache_creation_input_tokens and cache_read_input_tokens reach the client exactly as Anthropic returned them. Gateways that rewrite or normalize the request body can silently break caching by changing whitespace, reordering keys, or trimming fields.

Why did my cache_read_input_tokens stay at zero on the second call?

The most common reasons are byte-exact prefix mismatch, missing cache_control on the second call, exceeding the TTL, or a model change. Even a single character difference in the system prompt, a different tool ordering, or a swap from claude-sonnet-4-6 to claude-sonnet-4-5 will produce a cache miss. The cache is keyed on the full prefix, so anything before the marker that changes between calls invalidates the entry.

How many cache_control markers can I place in one request?

Anthropic allows up to four cache_control markers per request. A typical layout uses one on the system prompt, one at the end of the tools array, and one or two on early message turns to keep multi-turn conversations cache-friendly. Each marker creates a separate cache breakpoint, and the system picks the longest matching prefix on subsequent calls.

Does prompt caching work with streaming and tool use?

Yes. Prompt caching is transparent to the streaming protocol. The first chunk of an SSE stream includes the usage block with cache_creation_input_tokens or cache_read_input_tokens populated, and the rest of the stream behaves identically to a non-cached call. Tool use is fully compatible: the tools array is one of the recommended places to set cache_control, since tool schemas are usually stable across calls.

Can I cache user data, or only system prompts?

You can cache anything that appears as a stable prefix. That includes long retrieved documents, prior turns in a conversation, and few-shot examples. The constraint is that everything before the marker must be byte-identical on subsequent calls. For RAG, a common pattern is to put the retrieved chunk in a content block early in the user message and place cache_control there, so refinement turns reuse the document.

Is there a minimum size for a cache entry?

Yes. Cache entries have a minimum prefix length, currently 1024 tokens for most Claude models and 2048 for Haiku. A cache_control marker on a shorter prefix is silently ignored and the request is billed at the full input rate. This is why caching small system prompts often shows no savings even when the marker is present.

How do I see whether caching actually fired?

Inspect the usage object in the response. cache_creation_input_tokens shows tokens written to a new cache entry, cache_read_input_tokens shows tokens served from cache, and input_tokens shows the remaining uncached prefix plus the new turn. A successful cached call will have a low input_tokens value and a high cache_read_input_tokens value. Logging these three counters per call is the simplest cache observability you can deploy.

Home › Blog › Anthropic Prompt Caching Playbook

Anthropic Prompt Caching in Production: A Practical Cost-Reduction Playbook

Prompt caching is the largest single cost lever available on the Claude API, and most teams are still leaving it on the floor. Not because it is hard to enable, but because the patterns that make it work are different from the patterns that make a normal LLM call work. This playbook is the version we wish someone had handed us before our first invoice cycle.

Engineering playbook · Anthropic Messages API · cache_control, TTLs, and gateway transparency

Why prompt caching is the biggest cost lever you have

If you look at a real production Claude workload, the input side of the bill is almost always dominated by content that does not change between calls. A system prompt with policies and tone instructions. A tool schema with a dozen function definitions. Few-shot examples. A retrieved document that the user is asking three follow-up questions about. The user turn at the bottom of the request is often the smallest part of the payload, and yet without caching you pay full input rate for every byte above it on every single call.

Prompt caching flips that. Instead of treating each request as independent, the server stores the prefix you have marked as cacheable and lets subsequent requests skip the recomputation. The economics are not subtle. A cache read costs 0.1x the base input rate. That is not a discount, that is a different order of magnitude. A team that goes from no caching to well-placed caching on a high-volume agent typically sees input spend drop by 70 to 90 percent without changing the model, the prompt content, or the output quality.

The lever is real. The reason it goes unused is that prompt caching only pays off when the prefix is byte-identical across calls, and most application code does not naturally produce byte-identical prefixes. Once you internalize that constraint, the rest of the playbook is mechanical.

What prompt caching actually does

The Anthropic Messages API exposes prompt caching through a single mechanism: a cache_control field that you attach to a content block. When the request reaches the server, the system looks at every block up to and including the marker, hashes the byte representation of that prefix, and either:

creates a new cache entry containing the encoded form of that prefix, billed at the cache write rate, or
finds an existing entry that matches the prefix exactly, skips the work, and bills at the cache read rate.

The cache_control field takes a single shape today:

{"type": "ephemeral"}             # default 5-minute TTL
{"type": "ephemeral", "ttl": "1h"} # 1-hour TTL

You can place markers in three regions of the request: on the system array, on entries inside the tools array, and on content blocks inside messages. A request can carry up to four markers, and each one creates a separate breakpoint. On a subsequent call the server picks the longest matching breakpoint, so it is fine to place markers conservatively, knowing that the longest stable prefix wins.

Two non-obvious facts shape every decision below:

Prefix match is byte-exact. A trailing space, a key reordering, a different floating-point literal in the system prompt, or a model identifier swap all produce cache misses. Whatever generates the cacheable prefix has to be deterministic.
There is a minimum prefix size. Below roughly 1024 tokens (2048 for Haiku) the marker is ignored and you are billed as if no caching were requested. Caching very small system prompts is a no-op.

Pricing math: 50K-token system prompt, 100 calls

The fastest way to internalize the savings is to do the arithmetic on a realistic shape. Take a 50,000-token system prompt (policies, tone, schema description, few-shot examples) on Claude Sonnet, and assume the agent gets called 100 times in a 5-minute window. Anthropic's published rates apply:

Model	Base input	5-min write (1.25x)	1-hour write (2x)	Cache read (0.1x)
claude-opus-4-8	$5.00	$6.25	$10.00	$0.50
claude-sonnet-4-6	$3.00	$3.75	$6.00	$0.30
claude-haiku-4-5	$1.00	$1.25	$2.00	$0.10

All values are per million input tokens. Live, current numbers including any gateway discounts are at /api/pricing; this section uses Anthropic's first-party rates so the multipliers are visible.

No caching, Sonnet, 100 calls, 50K tokens each.

100 calls x 50,000 tokens = 5,000,000 input tokens
5,000,000 / 1,000,000 x $3.00 = $15.00

5-minute cache, Sonnet, 100 calls. The first call writes the cache; the next 99 read it.

Write:  50,000 / 1,000,000 x $3.75 = $0.1875
Reads:  99 x 50,000 / 1,000,000 x $0.30 = $1.4850
Total:  $1.6725

That is an 89 percent reduction on the cached prefix. The user-turn portion of each request is unaffected and still bills at the standard input rate, but in most workloads the user turn is a small minority of total input tokens.

1-hour cache, same shape.

Write:  50,000 / 1,000,000 x $6.00 = $0.30
Reads:  99 x 50,000 / 1,000,000 x $0.30 = $1.485
Total:  $1.785

Inside a 5-minute window, the 1-hour TTL is strictly more expensive because of the higher write multiplier. Its value comes entirely from amortizing across longer time gaps. Which leads directly to the next question.

5-minute vs 1-hour: where the break-even sits

The choice between TTLs is a function of two variables: how many reads the prefix will get before it expires, and how long the gaps between reads typically are.

Compare the cost difference between the two writes, and the cost difference between a read and a fresh full-input billing:

Extra write cost for choosing 1-hour over 5-minute: 0.75x base input per cached token.
Cost saved by a cache read versus a full input billing: 0.9x base input per cached token.

So the 1-hour TTL pays for itself if you get one additional read beyond the 5-minute window that the 5-minute cache could not have served. In the 50K-token Sonnet example, the extra write cost is 50,000 / 1,000,000 x ($6.00 - $3.75) = $0.1125. A single 1-hour read at $0.015 against a missed full-input call at $0.15 saves $0.135. The 1-hour cache is in the black after one extra hit.

Practical defaults that work for most workloads:

Interactive chat or coding agents with bursts of activity inside a few minutes: 5-minute TTL on the system prompt and tools.
Background pipelines, document review, batch jobs where calls are spaced out by tens of minutes but the same prefix recurs across hours: 1-hour TTL.
Long retrieved documents a user is iterating on across a working session: 1-hour TTL on the document block, 5-minute TTL on conversation turns.

The mistake to avoid is choosing the 1-hour TTL "to be safe" on a high-frequency workload. You pay the larger write multiplier on every fresh entry, which on chronically cache-cold prefixes (per-user state, request-specific data) actively raises the bill.

Code patterns that work

The Anthropic Python SDK accepts cache_control on the structured content blocks. The patterns below assume you have set base_url="https://buzzai.cc" on the client; the API surface is identical to first-party Anthropic, and every cache directive is forwarded unmodified.

1. System prompt and tools, single session

The most common starter pattern. The system prompt is a single text block, the tools array is short and stable, and you place one marker at the end of each. Two breakpoints, both stable across the agent's lifetime.

from anthropic import Anthropic

client = Anthropic(
    base_url="https://buzzai.cc",
    api_key="sk-...",  # BUZZ key
)

SYSTEM_PROMPT = """You are an internal compliance assistant.
[... 50K tokens of policies, tone rules, schema definitions ...]"""

TOOLS = [
    {
        "name": "lookup_customer",
        "description": "Fetch customer record by id.",
        "input_schema": {
            "type": "object",
            "properties": {"id": {"type": "string"}},
            "required": ["id"],
        },
    },
    # ... more tool definitions ...
]

# Mark the last tool with cache_control to cap the tools-array prefix.
TOOLS[-1] = {**TOOLS[-1], "cache_control": {"type": "ephemeral"}}

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=TOOLS,
    messages=[
        {"role": "user", "content": "Look up customer 12345 and summarize their last 3 tickets."}
    ],
)

print(resp.usage)
# Usage(input_tokens=42, cache_creation_input_tokens=51234,
#       cache_read_input_tokens=0, output_tokens=187)

On the second call within the TTL, with the same system prompt and tools array, you will see cache_creation_input_tokens=0 and cache_read_input_tokens=51234. The user turn at the bottom continues to bill at the standard input rate.

2. Multi-turn conversation

For a chat agent, the prefix you want to cache grows over time: turn 1 is stable by turn 2, turns 1-3 are stable by turn 4, and so on. The right pattern is to attach cache_control to the last assistant turn from the prior round, which makes everything up through that point cacheable.

def build_messages(history, new_user_input):
    messages = []
    for i, turn in enumerate(history):
        block = {"type": "text", "text": turn["text"]}
        # Mark the most recent assistant turn as a cache breakpoint.
        if i == len(history) - 1 and turn["role"] == "assistant":
            block["cache_control"] = {"type": "ephemeral"}
        messages.append({"role": turn["role"], "content": [block]})
    messages.append({"role": "user", "content": new_user_input})
    return messages

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"},
    }],
    messages=build_messages(history, "follow-up question..."),
)

This gives you up to three of the four available markers (system, last assistant turn, current user turn if needed) and lets the server pick the longest matching prefix on each call. As the conversation grows, the cache savings grow with it, because each new turn extends a prefix that was already cached on the prior call.

3. Long document, full content cached

For a document Q&A or code review workload, the document is the cacheable asset and the user turn is what changes. Put the document in its own content block at the start of the user message and mark it directly.

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    system=[{"type": "text", "text": "You are a careful technical reviewer."}],
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": large_document_text,  # 80K tokens
                    "cache_control": {"type": "ephemeral", "ttl": "1h"},
                },
                {
                    "type": "text",
                    "text": "Summarize the security implications in section 4.",
                },
            ],
        }
    ],
)

The first turn writes the document at the 1-hour rate. Every subsequent question against the same document, for an hour, reads at 0.1x. This is the pattern that produces the most dramatic dollar savings on long-context workloads, because the cache write is amortized across an entire reading session.

Through a gateway: what changes, what does not

If you call BUZZ AI Gateway instead of api.anthropic.com, prompt caching behaves identically from the application's point of view. The gateway is a transparent forwarder. Every byte of the request body, including cache_control markers, ordering of system blocks, ordering of tool definitions, and message content, is sent upstream without modification. The response, including the usage object with cache_creation_input_tokens and cache_read_input_tokens, is returned to the client as-is.

Concretely, the contract is:

BUZZ does not normalize whitespace, reorder JSON keys, strip null fields, or rewrite system prompts. Byte-exact prefix matching is preserved.
BUZZ does not inject its own system content, tool definitions, or guidance preamble. Prefix length stays whatever you sent.
BUZZ does not silently substitute models. claude-sonnet-4-6 reaches Anthropic as claude-sonnet-4-6, which keeps cache entries scoped correctly.
BUZZ operates with zero data retention. Bodies are not persisted; only billing metadata (model, token counts, timestamps) is kept. Caching state lives upstream at Anthropic, not at the gateway.

The point worth repeating: a gateway that rewrites requests breaks caching. A gateway that forwards bytes does not. The full lineup of supported models, including the cache-eligible ones used in this article, is at /models, and the live cost numbers including cache multipliers are at /api/pricing.

For Claude Code specifically, the install script at /sh/claudecode.sh configures the CLI to route through the gateway with one command:

curl -fsSL https://buzzai.cc/sh/claudecode.sh | bash

Claude Code's own caching of system prompts and tools continues to work because the request shape it sends is preserved end to end.

Common mistakes that quietly break caching

Most cache-related bugs are not visible: the request succeeds, the response is correct, the bill is just higher than it should be. The patterns below cover the failures we see most often.

Non-deterministic system prompts

If your system prompt is built by string-formatting in the current date, a request id, a user id, or any other per-request value above the cache marker, every call gets a fresh cache entry. Move volatile values below the cache marker, into the user turn, or into a separate non-cached block. A useful sanity check is to log a hash of the cached prefix on every call: if the hash changes between calls that should reuse cache, you have found the leak.

Reordered tool arrays or system blocks

Tools are often loaded from a registry that returns them in dictionary order, which is not stable across Python versions or across processes. Sort the tool array deterministically before sending and always slot the cache marker on the last element. The same applies to multi-block system arrays.

Missing the marker on later calls

Caching is opt-in per call. If only the first request includes cache_control and later requests omit it, the server has nothing to look up against and bills at the full input rate even though an entry exists. Place the marker in shared code, not in the first call site.

Prefix below the minimum length

Markers on prefixes shorter than 1024 tokens (2048 for Haiku) are ignored. If you cannot see cache_creation_input_tokens on the first call after enabling caching, your prefix is probably too short. Either consolidate more content above the marker or accept that this prompt is not large enough to benefit.

Wrong order of blocks

The marker caches everything above it, not below. If you place cache_control on the user's current question, you are asking the server to cache content that changes every call, which never produces a hit. Markers belong on stable blocks: system, tools, prior turns, and the early portion of a long document.

TTL drift

Reads against a 1-hour entry from a request that asks for a 5-minute TTL still work, but you cannot create a new 1-hour entry by mixing TTLs. Pick a TTL per breakpoint and stick to it across the lifetime of the cache.

Model swaps

Cache entries are scoped per model. Switching from claude-sonnet-4-6 to claude-opus-4-8 for "harder" requests is a complete cache miss for those requests, which means the higher per-token rate is paid against an uncached prefix. If you do A/B between models, expect both sides to maintain their own caches.

An observability checklist

Caching that you cannot see is caching you cannot tune. The minimum useful instrumentation is three counters per call, all available from the usage object on every response:

input_tokens — uncached prefix plus the new turn, billed at full input rate.
cache_creation_input_tokens — tokens written to a new cache entry, billed at the write multiplier.
cache_read_input_tokens — tokens served from cache, billed at the 0.1x read rate.

Aggregate these by route or feature, then plot the ratio cache_read / (cache_read + input + cache_creation) over time. A healthy production agent settles above 0.7 once warmed. Anything below 0.3 on a high-volume route is a tuning opportunity, almost always traceable to one of the mistakes in the previous section.

BUZZ exposes the same usage fields in the response body without modification, so the metric pipeline you would build against direct Anthropic usage works unchanged when you switch base_url to the gateway.

FAQ

Do I need to do anything special on the BUZZ side to enable caching?

No. Caching is an upstream feature governed entirely by the request body. If you set cache_control correctly, the gateway forwards it and Anthropic handles the rest.

Will caching change model behavior?

It should not. The encoded prefix is replayed as if it had been computed fresh; the model sees the same context. If you observe behavior drift after enabling caching, the most likely cause is an unrelated content change introduced at the same time.

What happens if Anthropic evicts the cache entry early?

The next request that would have been a hit is billed as a fresh write. There is no error, no warning, just a slightly higher bill on that call. Eviction is rare within the stated TTLs but not guaranteed.

Can I cache across users?

Yes, as long as the prefix is byte-identical and the calls are made under the same account. A shared system prompt and tools array used by every user benefits from a single cache entry, regardless of which user triggered the write.

Does caching change how rate limits work?

Cache reads still count against TPM (tokens per minute) for input, just at the much lower priced rate. RPM (requests per minute) is unaffected by caching.

Is caching available on every Claude model?

It is supported on the current Sonnet, Opus, and Haiku families, including claude-opus-4-8, claude-sonnet-4-6, and claude-haiku-4-5. Minimum prefix sizes vary; Haiku requires 2048 tokens before a marker takes effect.

Can I combine caching with the OpenAI-compatible endpoint?

Caching is an Anthropic Messages API feature. To use it, call Claude through the native Anthropic surface (BUZZ at https://buzzai.cc with the Anthropic SDK). The OpenAI-compatible endpoint at https://buzzai.cc/v1 is useful for portability but does not expose cache_control.

Should I cache few-shot examples?

Almost always yes. Few-shot examples are typically the longest stable region of a prompt and are reused identically across calls. Place them in the system block above the cache marker.

What if my system prompt is just barely under the minimum?

Pad it with content that is genuinely useful: explicit format examples, edge-case instructions, or a glossary. Padding for the sake of padding does not improve quality and often hurts it.

Conclusion

Prompt caching is the rare optimization that costs almost nothing to enable, requires no model change, and routinely cuts input spend by an order of magnitude. The reason it goes unused is operational, not technical: producing byte-identical prefixes across calls demands a small amount of discipline in how prompts are assembled, and most code does not start out that way.

The playbook is short. Mark the system prompt and tools array. Pick a TTL based on how often the prefix will recur. Keep volatile values below the marker. Watch cache_read_input_tokens and tune until it dominates. If you call Claude through BUZZ, every cache directive is forwarded unchanged, every usage field comes back unmodified, and the savings show up on the gateway bill at /api/pricing the same way they would on a direct integration. Models supported, including all cache-eligible variants, are listed at /models; Claude Code users can adopt the gateway in a single shell command from /sh/claudecode.sh.

If you do nothing else after reading this, instrument the three usage counters on your highest-volume Claude route. The gap between what your bill is and what it could be will be visible inside an hour.

Published: 2026-05-22

Last reviewed: 2026-05-22