BUZZ AI Gateway
Docs · Recipes · Document Q&A

Document Q&A

Pour a long document into Claude's 200K-token context, mark it as cached, and stream answers. Every follow-up question after the first reads the document at 1/10 the input price.

POST https://buzzai.cc/v1/messages
When to use this pattern. A single document of 5K-200K tokens, asked many questions over a session: contracts, RFCs, manuals, codebases dumped to text, transcripts. If your corpus is millions of tokens, use retrieval to slice it down and apply this pattern to the slice.

Request shape

Three layers, in this exact order:

  1. System block — instructions about how to answer (cached).
  2. System block — the document itself, wrapped in clear delimiters (cached).
  3. User message — the actual question (not cached).
{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "stream": true,
  "system": [
    {
      "type": "text",
      "text": "You answer questions strictly from the document below. If the answer is not in the document, say so. Quote the source span verbatim when relevant."
    },
    {
      "type": "text",
      "text": "<DOCUMENT name=\"contract.pdf\">\n... 25,000 tokens of document text ...\n</DOCUMENT>",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "What is the termination clause and how much notice is required?"}
  ]
}
Cache hits require an exact prefix match. The instructions block does not need cache_control if it never changes (it gets cached as part of the prefix anyway, up to the deepest cache breakpoint). What matters: the document block must be byte-identical across calls. If you concatenate user-specific data into the document, the cache misses.

Pick a model

ModelFit
claude-haiku-4-5-20251001FAQ-style lookups, short documents (<10K tokens), high-volume Q&A.
claude-sonnet-4-6Default. Long documents, multi-step reasoning over the text.
claude-opus-4-7Hardest analytical questions: cross-section comparison, contradiction detection, multi-document synthesis. Enable thinking for the gnarly ones.

Full streaming example

"""
Document Q&A with streaming. The document is cached on the first call;
every subsequent question hits the cache.
Requires: pip install anthropic
"""
import os, pathlib
from anthropic import Anthropic

client = Anthropic(
    base_url="https://buzzai.cc",
    api_key=os.environ["BUZZ_API_KEY"],
)

DOC_PATH = pathlib.Path("contract.txt")
DOC_TEXT = DOC_PATH.read_text()

INSTRUCTIONS = (
    "You answer questions strictly from the document below. "
    "If the answer is not in the document, say so. "
    "Quote the source span verbatim when relevant."
)

def system_blocks():
    return [
        {"type": "text", "text": INSTRUCTIONS},
        {
            "type": "text",
            "text": f'<DOCUMENT name="{DOC_PATH.name}">\n{DOC_TEXT}\n</DOCUMENT>',
            "cache_control": {"type": "ephemeral"},
        },
    ]


def ask(question: str):
    print(f"\nQ: {question}\nA: ", end="", flush=True)
    usage = None
    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system=system_blocks(),
        messages=[{"role": "user", "content": question}],
    ) as stream:
        for text in stream.text_stream:
            print(text, end="", flush=True)
        final = stream.get_final_message()
        usage = final.usage
    print()
    print(
        f"  [usage] input={usage.input_tokens} "
        f"cache_create={usage.cache_creation_input_tokens} "
        f"cache_read={usage.cache_read_input_tokens} "
        f"output={usage.output_tokens}"
    )


if __name__ == "__main__":
    ask("What is the termination clause and how much notice is required?")
    ask("Are there any non-compete restrictions, and how long do they last?")
    ask("Summarise the indemnification section in three bullets.")
// Document Q&A with streaming. Document cached on first call.
// Requires: npm i @anthropic-ai/sdk
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "node:fs";

const client = new Anthropic({
  baseURL: "https://buzzai.cc",
  apiKey: process.env.BUZZ_API_KEY,
});

const DOC_PATH = "contract.txt";
const DOC_TEXT = readFileSync(DOC_PATH, "utf8");

const INSTRUCTIONS =
  "You answer questions strictly from the document below. " +
  "If the answer is not in the document, say so. " +
  "Quote the source span verbatim when relevant.";

function systemBlocks() {
  return [
    { type: "text", text: INSTRUCTIONS },
    {
      type: "text",
      text: `<DOCUMENT name="${DOC_PATH}">\n${DOC_TEXT}\n</DOCUMENT>`,
      cache_control: { type: "ephemeral" },
    },
  ];
}

async function ask(question) {
  process.stdout.write(`\nQ: ${question}\nA: `);
  const stream = client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    system: systemBlocks(),
    messages: [{ role: "user", content: question }],
  });

  for await (const event of stream) {
    if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
      process.stdout.write(event.delta.text);
    }
  }
  const final = await stream.finalMessage();
  console.log(
    `\n  [usage] input=${final.usage.input_tokens} ` +
      `cache_create=${final.usage.cache_creation_input_tokens} ` +
      `cache_read=${final.usage.cache_read_input_tokens} ` +
      `output=${final.usage.output_tokens}`
  );
}

await ask("What is the termination clause and how much notice is required?");
await ask("Are there any non-compete restrictions, and how long do they last?");
await ask("Summarise the indemnification section in three bullets.");

What you'll see in usage

Q: What is the termination clause...
A: Either party may terminate with 60 days written notice...
  [usage] input=22 cache_create=24187 cache_read=0 output=98

Q: Are there any non-compete restrictions...
A: Section 8.2 imposes a 12-month non-compete...
  [usage] input=22 cache_create=0 cache_read=24187 output=84

Q: Summarise the indemnification section...
A: - Mutual indemnification for IP claims...
  [usage] input=22 cache_create=0 cache_read=24187 output=120

The first call writes 24K tokens to cache. Every subsequent call reads the same 24K from cache at 10% of the input price. Across three calls you pay ~1.2x the cost of a single call's input, instead of 3x.

Document preparation

Wrap with delimiters

Claude does well with explicit boundaries. Use a clear opening and closing tag with a name attribute so the model can cite it back to you:

<DOCUMENT name="employment-agreement-2026.pdf">
... text ...
</DOCUMENT>

Multiple documents

For two or three documents, concatenate them with separate tags. Each one stays inside the same cached block:

<DOCUMENT name="contract-a.pdf">...</DOCUMENT>

<DOCUMENT name="contract-b.pdf">...</DOCUMENT>

For dozens of documents, switch to a retrieval-then-cache approach: retrieve the top 3-5 chunks per query, format them in the system block, and accept that the cache will partially miss when the retrieved set changes.

PDF and Office files

Convert to text first. Common pipelines: pdftotext, pandoc, or library calls like pypdf / pdfjs-dist. Strip headers/footers and page numbers — they pollute the context and confuse citations.

Streaming the answer

Set stream: true. Claude emits SSE events; the SDK exposes a text_stream iterator that yields just the visible text. Final usage arrives at the end via get_final_message() / finalMessage().

Raw SSE shape, for clients that don't use the SDK:

event: message_start
data: {"type":"message_start","message":{"id":"msg_...","usage":{"input_tokens":22,"cache_creation_input_tokens":24187,"cache_read_input_tokens":0,...}}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Either party"}}
...
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":98,...}}

event: message_stop
data: {"type":"message_stop"}

Cache TTL

Anthropic's default ephemeral cache lives for 5 minutes after the last hit. For workloads where users sit and ask questions over an afternoon, that's usually fine — every fresh question refreshes the timer. For sessions with long pauses, opt into the 1-hour TTL:

"cache_control": {"type": "ephemeral", "ttl": "1h"}

The 1-hour TTL has a higher write cost (~2x the 5-minute write) but the read cost is the same. Worth it when you expect a session to span more than 5 minutes between questions.

Keep answers grounded

Two prompt patterns that reliably reduce hallucination:

Pattern 1: refuse if not present

If the answer is not present in the document, reply with exactly:
"NOT_IN_DOCUMENT: <short reason>"

Do not draw on outside knowledge.

Pattern 2: cite spans

Every claim in your answer must be followed by a quoted span from the
document in <quote>...</quote> tags. If a claim has no supporting
quote, do not make it.

Combine both for high-stakes use (legal, compliance). For consumer-facing summaries, pattern 1 alone usually suffices.

When this pattern doesn't fit

SituationBetter approach
Corpus > 200K tokensRetrieval (BM25 or embeddings) before this recipe. See /v1/rerank to refine candidates.
Document changes per requestDon't bother caching. The create cost will dwarf any read savings.
Same doc, different users, <5 min apartCaching works across users — same cache key, same hit. Just keep the prefix byte-identical.
Document contains user PIISplice user-specific data into the user message, not the cached system block. Keeps the cache key stable and the user data out of long-lived cache.

See also