Why Your cache_creation_input_tokens Is Zero: 7 Prompt Cache Antipatterns
You added cache_control to the system prompt. You re-deployed. You opened the dashboard expecting input spend to fall off a cliff. Instead, cache_creation_input_tokens reads 0 on every call, cache_read_input_tokens reads 0 on every call, and your bill looks the same. The cache was never built. Here are the seven reasons that happens, in roughly the order you should check them.
Quick recap of Anthropic Prompt Cache
The Anthropic prompt cache is one knob: a cache_control field attached to a content block. The server hashes the byte representation of every block up to and including the marker, and either creates a new entry or matches an existing one. There is one type today, ephemeral, with two TTL choices.
{"type": "ephemeral"} # 5-minute TTL (default)
{"type": "ephemeral", "ttl": "1h"} # 1-hour TTL
You can set up to four markers per request, across the system array, the tools array, and content blocks inside messages. The server picks the longest matching prefix on subsequent calls and bills the matched portion at 0.1x the base input rate. The unmatched prefix is billed at the cache write rate (1.25x for 5 minutes, 2x for 1 hour).
Two facts decide everything below:
- The cache key is a byte-exact hash of the prefix. Whitespace counts. Order counts. Floating-point literals count.
- The minimum cacheable prefix is 1024 tokens on Sonnet and Opus, 2048 tokens on Haiku. Markers below the threshold are silently ignored.
If your usage object reads cache_creation_input_tokens=0 and cache_read_input_tokens=0, one of the seven antipatterns below is in the request you just sent.
Antipattern 1: cache_control on dynamic content
The marker caches everything above and including the block it is attached to. Putting it on a block whose content is different on every call is the most common cause of zero hits. The marker is not a hint; it is the boundary. If the boundary is volatile, the entire prefix below it is volatile.
Broken
# The system prompt is stable, but the marker is on the user turn,
# which is a fresh string every call. Cache key changes every time.
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{"type": "text", "text": LONG_STABLE_SYSTEM_PROMPT}],
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": user_question, # changes per call
"cache_control": {"type": "ephemeral"},
}
],
}
],
)
# usage.cache_creation_input_tokens = 51234 first call
# usage.cache_creation_input_tokens = 51234 second call (still 0 reads)
Each call writes a brand-new entry that no future call will ever match, because the user question is part of the cached prefix. You are paying the 1.25x write multiplier on every request.
Fixed
# Move the marker to the stable region. The system prompt becomes
# the cached prefix. The user turn lives below the marker and bills
# at the standard input rate, which is what you want.
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_STABLE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_question}],
)
# usage.cache_creation_input_tokens = 51234 first call
# usage.cache_read_input_tokens = 51234 subsequent calls
Rule of thumb: never attach cache_control to a block whose contents you cannot reproduce byte-for-byte on the next call.
Antipattern 2: breakpoint at the most volatile location
A subtler version of the same mistake. The block the marker sits on is stable, but a block above it in the prefix changes between calls. The cache key spans the entire prefix, so anything that is part of the prefix and changes invalidates the entry, even if the marker itself looks fine.
Broken
# Three system blocks. The first is the stable policy. The second
# carries today's date, which changes every midnight at minimum.
# The marker sits on the third, stable block. The prefix as a whole
# is volatile because of block 2, so no two calls share a cache entry
# unless they happen on the same day.
system_blocks = [
{"type": "text", "text": POLICY_DOCUMENT}, # stable, large
{"type": "text", "text": f"Today is {datetime.utcnow().isoformat()}"}, # volatile
{
"type": "text",
"text": OUTPUT_FORMAT_SPEC, # stable
"cache_control": {"type": "ephemeral"},
},
]
You will see cache_creation_input_tokens on every call and cache_read_input_tokens=0 across days. The marker is in the right place; the volatility is upstream.
Fixed
# Move volatile content below the cached prefix. Either push it
# into the first user turn, or split system into two arrays where
# the cached one comes first.
system_blocks = [
{"type": "text", "text": POLICY_DOCUMENT},
{
"type": "text",
"text": OUTPUT_FORMAT_SPEC,
"cache_control": {"type": "ephemeral"},
},
# Anything below the marker is uncached. Put dynamic context here.
{"type": "text", "text": f"Today is {datetime.utcnow().isoformat()}"},
]
The cached prefix is now genuinely stable across calls within the TTL. The dynamic block still ships to the model, just outside the cached region.
Antipattern 3: 4 breakpoints used wrong
Anthropic gives you four markers per request. The temptation is to spend all four. The trap is that markers do not magically stack savings; they create breakpoints, and only the longest matching prefix counts on a hit. Spending four markers near each other on the same stable block produces one effective breakpoint and three wasted slots.
Broken
# Four markers, all clustered near the end of the system prompt.
# Three of them break the prefix into segments that no real call
# pattern uses, and the fourth is the only one that matters.
system_blocks = [
{"type": "text", "text": POLICY_PART_1, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": POLICY_PART_2, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": POLICY_PART_3, "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": POLICY_PART_4, "cache_control": {"type": "ephemeral"}},
]
# Effectively only the last marker contributes; the budget is exhausted
# before you can place one on tools or on a prior conversation turn.
The pattern that pays for the budget is to put each marker at a different stability boundary, where stability boundary means: how often does the content above this point change relative to content below it?
Fixed
# One marker per natural boundary. Four breakpoints, four different
# reuse patterns. The server picks the longest matching prefix per call.
system_blocks = [
{
"type": "text",
"text": POLICY_DOCUMENT, # changes monthly
"cache_control": {"type": "ephemeral", "ttl": "1h"},
},
]
tools = [
*FUNCTION_DEFS, # changes per release
]
tools[-1] = {**tools[-1], "cache_control": {"type": "ephemeral"}}
messages = [
# Few-shot examples, stable per workspace
{"role": "user", "content": [{"type": "text", "text": FEW_SHOT_EXAMPLES,
"cache_control": {"type": "ephemeral"}}]},
{"role": "assistant", "content": "Understood."},
# Prior conversation turn, stable for the rest of the session
{"role": "assistant", "content": [{"type": "text", "text": prior_turn_text,
"cache_control": {"type": "ephemeral"}}]},
{"role": "user", "content": current_user_input},
]
Now each marker corresponds to a different reuse horizon: policy across hours, tools across releases, few-shots across the workspace, and the prior turn across this session. The server picks the longest matching prefix and you get cumulative savings instead of redundant ones.
Antipattern 4: timestamps and UUIDs in the prompt
This is antipattern 1 in disguise, common enough to deserve its own section. Anything that looks like a request id, a session id, an ISO timestamp, a random nonce, or a request-bound user agent will silently invalidate the cache when it lands above the marker. The request still works. The bill goes up.
Broken
# A trace id is helpful for debugging, but if it lands inside the
# cached prefix it changes the byte hash on every call.
trace_id = uuid.uuid4().hex
system_text = (
f"# Trace: {trace_id}\n"
f"# Generated at: {datetime.utcnow().isoformat()}\n"
f"\n"
f"{POLICY_DOCUMENT}\n"
)
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{"type": "text", "text": system_text,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": user_question}],
)
Every call gets a fresh trace id. Every call writes a fresh cache entry. The hit rate is structurally zero.
Fixed
# Keep the cached system prompt deterministic. Pass per-call
# metadata via metadata.user_id (which does not affect the cache key)
# or via the user turn (which is below the marker).
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": POLICY_DOCUMENT, # no trace id, no timestamp
"cache_control": {"type": "ephemeral"},
}
],
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": f"[trace={trace_id}]"},
{"type": "text", "text": user_question},
],
}
],
metadata={"user_id": user_id}, # never part of the cache key
)
If you cannot live without a trace marker visible to the model, put it in the user turn where it belongs. Cached blocks should be reproducible from a deterministic builder function with no datetime.now(), no uuid4(), and no environment-dependent values.
Antipattern 5: under the 1024-token threshold
Markers do not raise an error when the prefix is too short. They are silently ignored, and the request is billed at the standard input rate. Teams enable caching on a small system prompt, watch cache_creation_input_tokens stay at zero on the first call, and conclude that caching is broken. It is not broken; the prefix simply did not qualify.
Broken
# A short system prompt. ~600 tokens, well below the 1024 minimum.
# The marker is set, but the server ignores it.
SHORT_SYSTEM = """You are a helpful assistant. Answer concisely.
Use markdown. Cite sources. Do not invent facts."""
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[{"type": "text", "text": SHORT_SYSTEM,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": user_question}],
)
# usage.cache_creation_input_tokens = 0
# usage.cache_read_input_tokens = 0
# usage.input_tokens = 612 + user turn (full price)
Two corrections work, depending on the actual workload.
Fixed (consolidate above the marker)
# If you have content scattered across several places that does belong
# in context, consolidate it above the marker until the prefix crosses
# the threshold. Few-shot examples, glossaries, and tool schemas are
# the usual sources.
SYSTEM = SHORT_SYSTEM + "\n\n" + GLOSSARY + "\n\n" + FEW_SHOT_EXAMPLES
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[{"type": "text", "text": SYSTEM,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": user_question}],
tools=TOOLS, # tools count toward the prefix
)
Fixed (accept that this prompt is not a candidate)
# If there is genuinely no content above 1024 tokens that needs to
# be cached, drop the marker. Padding for padding's sake hurts
# quality and does not help the bill.
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[{"type": "text", "text": SHORT_SYSTEM}],
messages=[{"role": "user", "content": user_question}],
)
Not every prompt is worth caching. The threshold exists because the bookkeeping for very small entries is not worth the savings. If your prompt is naturally short, leave it alone.
Antipattern 6: cross-user system prompt with user-specific bits
The most painful version of antipattern 1, because it is invisible at small scale. A team builds a multi-tenant agent. The system prompt is shared. They sprinkle the user's name, role, or workspace id into the system prompt for personalization. With one user it looks fine. At ten users the cache hit rate drops to roughly 1/N, because every user has a unique cached prefix and no cross-user reuse is possible.
Broken
# User-specific data above the cache marker. Each user creates
# their own entry. Cross-user reuse is impossible.
def build_system(user):
return f"""You are an assistant for {user.name} ({user.role}).
The current workspace is {user.workspace_id}.
Their preferred output format is {user.format_pref}.
{COMMON_POLICY}
{COMMON_FEW_SHOTS}"""
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{"type": "text", "text": build_system(user),
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": user_question}],
)
If 1000 users send a request inside the TTL, the server creates 1000 cache entries. The 0.1x read rate never kicks in because each user's first call is also their only call within the window.
Fixed
# Two system blocks. The first is identical for every user and
# carries the cache marker. The second is per-user and lives below
# the marker, so it does not affect the cache key.
def build_system(user):
return [
{
"type": "text",
"text": COMMON_POLICY + "\n\n" + COMMON_FEW_SHOTS,
"cache_control": {"type": "ephemeral"},
},
{
"type": "text",
"text": (
f"User: {user.name} ({user.role})\n"
f"Workspace: {user.workspace_id}\n"
f"Format: {user.format_pref}"
),
},
]
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=build_system(user),
messages=[{"role": "user", "content": user_question}],
)
One cache entry, shared across every user in the account. The personalization block ships uncached, but it is small relative to the policy and few-shot content. The first user inside the TTL pays the write; everyone else pays a read.
Antipattern 7: gateway or proxy strips cache_control
The application code is correct. The cache directive is on the right block. The prefix is large, stable, and deterministic. And cache_creation_input_tokens is still zero on every call. At this point the suspect is the network path.
Some gateways and observability proxies rewrite the request body before forwarding it: stripping unknown fields, normalizing JSON whitespace, sorting object keys, downcasing model identifiers, or injecting an extra system message. Any of these mutations changes the byte hash that Anthropic computes, and the cache loses its key. Worse, some gateways drop the cache_control field entirely on the assumption that it is provider-specific metadata.
Broken
# Calling through a proxy that helpfully reformats the request body.
# The application sends cache_control. The proxy strips it (or sorts
# the JSON keys, or trims trailing whitespace). Anthropic sees a
# different prefix on every call.
client = Anthropic(
base_url="https://my-proxy.internal", # rewrites the request body
api_key="sk-...",
)
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{"type": "text", "text": LARGE_STABLE_PROMPT,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": user_question}],
)
# usage.cache_creation_input_tokens = 0
# usage.cache_read_input_tokens = 0
# usage.input_tokens = full prefix + user turn
The request succeeds. The response is correct. The bill is the same as before you added caching. This is the worst category of bug because nothing surfaces in the application.
Fixed
# Use a transparent gateway that forwards the request body byte for
# byte. BUZZ AI Gateway preserves cache_control, system block order,
# tool ordering, whitespace, and the model identifier. The usage
# object is returned unchanged.
client = Anthropic(
base_url="https://buzzai.cc",
api_key="sk-...", # BUZZ key
)
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[{"type": "text", "text": LARGE_STABLE_PROMPT,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": user_question}],
)
# usage.cache_creation_input_tokens = 51234 first call
# usage.cache_read_input_tokens = 51234 subsequent calls within TTL
This is a deliberate BUZZ AI Gateway design choice. The gateway does not normalize whitespace, reorder JSON keys, strip unknown fields, or rewrite the model identifier. Every cache directive on system blocks, tool definitions, and message content is forwarded unmodified, and the response is returned as-is. Zero retention means request bodies are not persisted; only billing metadata is kept. The full lineup of cache-eligible models is at /models, and live cache-multiplier pricing is at /api/pricing.
If you suspect your current path is rewriting requests, the diagnostic is simple: send the same request twice and check the usage. If cache_read_input_tokens is non-zero on the second call, the path is transparent. If it stays at zero on a request that should hit, the path is rewriting something.
Diagnostic: reading cache behavior from usage fields
Every Messages API response carries a usage object. Three of its fields tell you everything about caching for that call.
| Field | Meaning | Billed at |
|---|---|---|
input_tokens | Tokens not served from cache and not written to cache. The user turn, plus any prefix that fell outside the matched cache region. | 1.0x base input |
cache_creation_input_tokens | Tokens written to a new cache entry on this call. | 1.25x (5m) or 2x (1h) |
cache_read_input_tokens | Tokens served from an existing cache entry. | 0.1x base input |
The minimum useful instrumentation logs all three per call, plus the model and a hash of the cached prefix. A working cache settles into a recognizable pattern.
# First call after deploy
input_tokens = 142
cache_creation_input_tokens = 51234
cache_read_input_tokens = 0
# Second call, same prefix, within TTL
input_tokens = 158
cache_creation_input_tokens = 0
cache_read_input_tokens = 51234
# 100th call, same prefix
input_tokens = 173
cache_creation_input_tokens = 0
cache_read_input_tokens = 51234
Failure modes look like:
# Antipatterns 1, 2, 4, 6: prefix changes every call
input_tokens = 142
cache_creation_input_tokens = 51234 # every call
cache_read_input_tokens = 0 # every call
# Antipattern 5: prefix below threshold
input_tokens = 612 + user turn
cache_creation_input_tokens = 0
cache_read_input_tokens = 0
# Antipattern 7: gateway rewriting body
input_tokens = 51234 + user turn
cache_creation_input_tokens = 0
cache_read_input_tokens = 0
The shape of the failure narrows the cause. Persistent cache_creation_input_tokens > 0 with zero reads means a volatile prefix. Both at zero with a large input_tokens means the marker is not reaching Anthropic, either because the prefix is too short or because the network path is mutating the request.
Hash logging closes the loop:
import hashlib, json
def cache_prefix_hash(system_blocks, tools):
payload = json.dumps([system_blocks, tools], sort_keys=False)
return hashlib.sha256(payload.encode()).hexdigest()[:12]
logger.info("claude_call", extra={
"model": "claude-sonnet-4-6",
"prefix_hash": cache_prefix_hash(system, tools),
"input_tokens": resp.usage.input_tokens,
"cache_creation": resp.usage.cache_creation_input_tokens,
"cache_read": resp.usage.cache_read_input_tokens,
})
Calls that should reuse the cache will share a prefix_hash. If the hash differs across calls that should hit, antipattern 1, 2, 4, or 6 is at play. If the hash matches across calls but the cache still misses, antipattern 5 or 7 is at play.
Real numbers: 30 percent to 92 percent hit rate
The arithmetic on a synthetic but realistic workload is the cleanest way to see what fixing these antipatterns is worth. Take a customer-support agent on Sonnet with the following shape, and assume 10,000 calls per hour during business hours.
- System prompt: 38,000 tokens (policy, tone, tool descriptions in prose, format spec).
- Tools: 6,000 tokens of function definitions.
- Few-shot examples: 8,000 tokens.
- Average user turn: 250 tokens.
- Average output: 300 tokens.
The cacheable prefix is roughly 52,000 tokens. Without caching, hourly input cost is
10,000 calls x 52,250 tokens = 522,500,000 input tokens/hour
522,500,000 / 1,000,000 x $3.00 = $1,567.50/hour
With antipattern 6 in place (per-user data above the marker), each user gets their own entry. Assume 3,000 active users in the hour. Each user has roughly 3 calls inside their session. The first call writes; the next two read.
Writes: 3,000 x 52,000 / 1,000,000 x $3.75 = $585.00
Reads: 7,000 x 52,000 / 1,000,000 x $0.30 = $109.20
User turns (uncached): 10,000 x 250 / 1M x $3 = $7.50
Total = $701.70/hour
Hit rate (read / read+create+input) = 70%/(70%+30%+small) ~= 30%
That is already a meaningful saving versus no caching. Now move the user-specific bits below the marker (antipattern 6 fix), so all 10,000 calls share one cache entry per TTL window. Two writes per hour (one per 5-minute reset), 9,998 reads.
Writes: 12 x 52,000 / 1,000,000 x $3.75 = $2.34
Reads: 9,988 x 52,000 / 1,000,000 x $0.30 = $155.81
User turns (uncached): 10,000 x 250 / 1M x $3 = $7.50
Total = $165.65/hour
Hit rate ~= 92%
From $1,567.50/hour uncached, to $701.70/hour with the per-user-bits antipattern, to $165.65/hour after the fix. The fix is moving roughly fifty bytes of personalization from one position in the system array to another. The hit rate went from 30 percent to 92 percent without changing the model, the prompt content, or the output quality.
This is why the seven antipatterns matter. Each one looks small in isolation. Each one quietly compounds against the multiplier on every call.
FAQ
What does cache_creation_input_tokens equal to zero actually mean?
Either the request reused an existing entry (in which case cache_read_input_tokens is non-zero) or the cached prefix was below the 1024-token minimum and the marker was silently ignored. If both cache_creation_input_tokens and cache_read_input_tokens are zero, caching did not engage at all.
Does cache_control: ephemeral mean the cache disappears immediately?
No. ephemeral is the type identifier, not a duration. The actual lifetime is governed by ttl, defaulting to 5 minutes and configurable to 1h. Within the TTL the entry behaves like any server cache: present until evicted.
Why does my cache fail when I add a timestamp to the system prompt?
The cache key is a byte-exact hash of the prefix up to the marker. A timestamp changes the prefix every call, so every call writes a fresh entry and no call ever reads. Move dynamic values below the marker, into the user turn, or into a non-cached block placed after the cached one.
How many cache_control breakpoints can I use in one request?
Up to four. Each marker creates a breakpoint, and on subsequent calls the server picks the longest matching prefix. Spending all four near each other in the same stable region wastes the budget; place them at distinct stability boundaries.
Is the 1024-token minimum the same on every model?
No. Most current Claude models use 1024 tokens as the minimum cacheable prefix. Haiku raises the threshold to 2048. A marker on a shorter prefix is silently ignored and the call is billed at the standard input rate.
Will a gateway break my prompt cache?
It can. Cache hits depend on byte-exact prefix match, so any gateway that rewrites the request body, normalizes JSON, reorders keys, or trims whitespace destroys the key. BUZZ AI Gateway forwards the body unchanged, which is why cache_control directives survive end to end.
Can I cache user-specific content in a shared system prompt?
Not without giving up the shared cache. If the user id, role, or preferences live above the marker, every user has their own entry. Keep the cached region user-agnostic and pass user-specific data as a non-cached system block placed after the cached one or in the first user turn.
How do I diagnose a low cache hit rate quickly?
Log input_tokens, cache_creation_input_tokens, and cache_read_input_tokens per call, plus a SHA-256 hash of the cached prefix. If cache_creation_input_tokens is consistently non-zero on a route that should reuse the prefix, the prefix is changing between calls. Hash mismatches identify the volatile region.
Conclusion
Prompt caching looks like a one-line change because it is. The reason real workloads see cache_creation_input_tokens=0 after enabling it is that the surrounding code is not yet writing byte-deterministic prefixes. Each of the seven antipatterns above is a different way that determinism breaks: dynamic content above the marker, volatile timestamps, breakpoints spent in the wrong places, prompts below the threshold, per-user personalization that fragments the cache, or a network path that rewrites bodies in transit.
The fix in every case is the same shape: stable content above the marker, dynamic content below, one breakpoint per real stability boundary, and a transparent path to Anthropic. Once those four conditions hold, the usage object turns honest, the hit rate climbs above 0.7, and the bill finally reflects the multiplier you thought you were buying.
Cache directives forwarded byte-for-byte
BUZZ AI Gateway is a transparent forwarder for the Anthropic Messages API. cache_control on system blocks, tools, and messages reaches Anthropic unmodified. cache_creation_input_tokens and cache_read_input_tokens come back unmodified. Zero retention on request bodies. Switch base_url to https://buzzai.cc and your existing caching code keeps working.
Supported models · Live pricing · Claude Code one-line installer
Published: 2026-05-26
Last reviewed: 2026-05-26