BUZZ AI Gateway
Docs · Concepts · Prompt Caching

Prompt Caching

Anthropic's prompt cache lets repeated long contexts be billed at one-tenth the input price. The mechanism is dead simple from the outside: tag a content block with cache_control, and on subsequent calls those tokens count as cache_read_input_tokens instead of input_tokens. Knowing when, where, and how long to cache turns this from a nice optimization into a major cost lever.

What it is

A Claude request is, at the model level, a flat stream of tokens. The cache lets Anthropic memoize the prefix of that stream. The first time you send a tagged prefix, the model processes it and Anthropic writes the resulting attention state to an ephemeral cache. Subsequent requests that share the same prefix skip the compute and pay a small read fee instead.

The pricing model is straightforward:

Token categoryRelative cost
Standard input1.0×
Cache write (5-minute)1.25× (one-time)
Cache write (1-hour)2.0× (one-time)
Cache read0.1×
Output5.0× (varies by model)

You pay a small premium on the first call, then 90% off forever (well, until the TTL expires). The arithmetic is favorable any time the cached prefix is reused at least twice within its TTL window.

These multipliers are Anthropic-published pricing, applied upstream. BUZZ forwards cache_control directives byte-for-byte and bills based on the cached/non-cached token counts the upstream returns — it does not impose any additional cache surcharge.

How cache_control works

You annotate one or more content blocks with cache_control. The marker says "everything from the start of the request up to and including this block is a candidate cache prefix". Anthropic builds a hash of that prefix and looks it up.

{
  "model": "claude-haiku-4-5-20251001",
  "max_tokens": 512,
  "system": [
    {
      "type": "text",
      "text": "<30 KB of stable instructions and context>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "Question about the context above"}
  ]
}

Two important rules:

  1. Cache breakpoints are positional. The cached span runs from byte zero to the marked block. If you change anything earlier in the prompt, the cache misses. Stable, unchanging content goes first.
  2. You can mark up to four breakpoints. They form a layered cache, longest-match-wins. Use this to cache a stable system prompt, then a stable middle section, then a section that updates per-session, etc.

Cache hits show up in the response under usage:

"usage": {
  "input_tokens": 12,
  "cache_creation_input_tokens": 0,
  "cache_read_input_tokens": 30000,
  "output_tokens": 240
}

That cache_read_input_tokens: 30000 is the win. Those 30,000 tokens cost one-tenth what they would have cost as standard input.

5-minute vs 1-hour TTL

Anthropic offers two cache durations. The decision is almost always driven by inter-request gap.

TTLWrite premiumUse when
5 minutes (default)1.25×Interactive sessions where the same context is hit multiple times within a few minutes (chat conversations, agent tool loops, IDE assistants).
1 hour2.0×Periodic batch jobs, scheduled report generation, or anything where the same context is reused across requests separated by minutes-to-an-hour.

Specify the longer TTL like this:

"cache_control": {"type": "ephemeral", "ttl": "1h"}

The simple decision rule: if your typical re-use gap is under five minutes, take the default. If your re-use gap is between five minutes and an hour, the 1-hour cache is cheaper than re-creating a 5-minute cache. Above an hour, you are creating fresh caches every time and should re-think whether caching helps at all.

How BUZZ forwards cache_control

BUZZ does not interpret cache_control. It is forwarded byte-for-byte to the upstream channel. The cache itself lives in Anthropic's infrastructure, not in BUZZ. This has three implications worth understanding:

Practical patterns

Long stable system prompt. If your assistant has a 20K-token instruction and example block, mark it with cache_control. Every conversation turn after the first becomes near-free on the input side.

RAG with a stable retrieval set. If a session retrieves the same documents repeatedly, place those documents in the system prompt and cache them. The user message changes; the cache prefix does not.

Tool definition reuse. Long tools[] arrays can be cached. Place the cache breakpoint after the last tool definition so the entire tool catalog becomes a single cached prefix.

Multi-step agent loops. Each tool round-trip is a new request. With caching, the conversation history (often the longest part of the prompt) is paid once and read back on every iteration. The savings compound with conversation length.

See also