Document Q&A
Pour a long document into Claude's 200K-token context, mark it as cached, and stream answers. Every follow-up question after the first reads the document at 1/10 the input price.
Request shape
Three layers, in this exact order:
- System block — instructions about how to answer (cached).
- System block — the document itself, wrapped in clear delimiters (cached).
- User message — the actual question (not cached).
{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"stream": true,
"system": [
{
"type": "text",
"text": "You answer questions strictly from the document below. If the answer is not in the document, say so. Quote the source span verbatim when relevant."
},
{
"type": "text",
"text": "<DOCUMENT name=\"contract.pdf\">\n... 25,000 tokens of document text ...\n</DOCUMENT>",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "What is the termination clause and how much notice is required?"}
]
}
cache_control if it never changes (it gets cached as part of the prefix anyway, up to the deepest cache breakpoint). What matters: the document block must be byte-identical across calls. If you concatenate user-specific data into the document, the cache misses.
Pick a model
| Model | Fit |
|---|---|
claude-haiku-4-5-20251001 | FAQ-style lookups, short documents (<10K tokens), high-volume Q&A. |
claude-sonnet-4-6 | Default. Long documents, multi-step reasoning over the text. |
claude-opus-4-7 | Hardest analytical questions: cross-section comparison, contradiction detection, multi-document synthesis. Enable thinking for the gnarly ones. |
Full streaming example
"""
Document Q&A with streaming. The document is cached on the first call;
every subsequent question hits the cache.
Requires: pip install anthropic
"""
import os, pathlib
from anthropic import Anthropic
client = Anthropic(
base_url="https://buzzai.cc",
api_key=os.environ["BUZZ_API_KEY"],
)
DOC_PATH = pathlib.Path("contract.txt")
DOC_TEXT = DOC_PATH.read_text()
INSTRUCTIONS = (
"You answer questions strictly from the document below. "
"If the answer is not in the document, say so. "
"Quote the source span verbatim when relevant."
)
def system_blocks():
return [
{"type": "text", "text": INSTRUCTIONS},
{
"type": "text",
"text": f'<DOCUMENT name="{DOC_PATH.name}">\n{DOC_TEXT}\n</DOCUMENT>',
"cache_control": {"type": "ephemeral"},
},
]
def ask(question: str):
print(f"\nQ: {question}\nA: ", end="", flush=True)
usage = None
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
system=system_blocks(),
messages=[{"role": "user", "content": question}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final = stream.get_final_message()
usage = final.usage
print()
print(
f" [usage] input={usage.input_tokens} "
f"cache_create={usage.cache_creation_input_tokens} "
f"cache_read={usage.cache_read_input_tokens} "
f"output={usage.output_tokens}"
)
if __name__ == "__main__":
ask("What is the termination clause and how much notice is required?")
ask("Are there any non-compete restrictions, and how long do they last?")
ask("Summarise the indemnification section in three bullets.")
// Document Q&A with streaming. Document cached on first call.
// Requires: npm i @anthropic-ai/sdk
import Anthropic from "@anthropic-ai/sdk";
import { readFileSync } from "node:fs";
const client = new Anthropic({
baseURL: "https://buzzai.cc",
apiKey: process.env.BUZZ_API_KEY,
});
const DOC_PATH = "contract.txt";
const DOC_TEXT = readFileSync(DOC_PATH, "utf8");
const INSTRUCTIONS =
"You answer questions strictly from the document below. " +
"If the answer is not in the document, say so. " +
"Quote the source span verbatim when relevant.";
function systemBlocks() {
return [
{ type: "text", text: INSTRUCTIONS },
{
type: "text",
text: `<DOCUMENT name="${DOC_PATH}">\n${DOC_TEXT}\n</DOCUMENT>`,
cache_control: { type: "ephemeral" },
},
];
}
async function ask(question) {
process.stdout.write(`\nQ: ${question}\nA: `);
const stream = client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: systemBlocks(),
messages: [{ role: "user", content: question }],
});
for await (const event of stream) {
if (event.type === "content_block_delta" && event.delta.type === "text_delta") {
process.stdout.write(event.delta.text);
}
}
const final = await stream.finalMessage();
console.log(
`\n [usage] input=${final.usage.input_tokens} ` +
`cache_create=${final.usage.cache_creation_input_tokens} ` +
`cache_read=${final.usage.cache_read_input_tokens} ` +
`output=${final.usage.output_tokens}`
);
}
await ask("What is the termination clause and how much notice is required?");
await ask("Are there any non-compete restrictions, and how long do they last?");
await ask("Summarise the indemnification section in three bullets.");
What you'll see in usage
Q: What is the termination clause...
A: Either party may terminate with 60 days written notice...
[usage] input=22 cache_create=24187 cache_read=0 output=98
Q: Are there any non-compete restrictions...
A: Section 8.2 imposes a 12-month non-compete...
[usage] input=22 cache_create=0 cache_read=24187 output=84
Q: Summarise the indemnification section...
A: - Mutual indemnification for IP claims...
[usage] input=22 cache_create=0 cache_read=24187 output=120
The first call writes 24K tokens to cache. Every subsequent call reads the same 24K from cache at 10% of the input price. Across three calls you pay ~1.2x the cost of a single call's input, instead of 3x.
Document preparation
Wrap with delimiters
Claude does well with explicit boundaries. Use a clear opening and closing tag with a name attribute so the model can cite it back to you:
<DOCUMENT name="employment-agreement-2026.pdf">
... text ...
</DOCUMENT>
Multiple documents
For two or three documents, concatenate them with separate tags. Each one stays inside the same cached block:
<DOCUMENT name="contract-a.pdf">...</DOCUMENT>
<DOCUMENT name="contract-b.pdf">...</DOCUMENT>
For dozens of documents, switch to a retrieval-then-cache approach: retrieve the top 3-5 chunks per query, format them in the system block, and accept that the cache will partially miss when the retrieved set changes.
PDF and Office files
Convert to text first. Common pipelines: pdftotext, pandoc, or library calls like pypdf / pdfjs-dist. Strip headers/footers and page numbers — they pollute the context and confuse citations.
Streaming the answer
Set stream: true. Claude emits SSE events; the SDK exposes a text_stream iterator that yields just the visible text. Final usage arrives at the end via get_final_message() / finalMessage().
Raw SSE shape, for clients that don't use the SDK:
event: message_start
data: {"type":"message_start","message":{"id":"msg_...","usage":{"input_tokens":22,"cache_creation_input_tokens":24187,"cache_read_input_tokens":0,...}}}
event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Either party"}}
...
event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{"output_tokens":98,...}}
event: message_stop
data: {"type":"message_stop"}
Cache TTL
Anthropic's default ephemeral cache lives for 5 minutes after the last hit. For workloads where users sit and ask questions over an afternoon, that's usually fine — every fresh question refreshes the timer. For sessions with long pauses, opt into the 1-hour TTL:
"cache_control": {"type": "ephemeral", "ttl": "1h"}
The 1-hour TTL has a higher write cost (~2x the 5-minute write) but the read cost is the same. Worth it when you expect a session to span more than 5 minutes between questions.
Keep answers grounded
Two prompt patterns that reliably reduce hallucination:
Pattern 1: refuse if not present
If the answer is not present in the document, reply with exactly:
"NOT_IN_DOCUMENT: <short reason>"
Do not draw on outside knowledge.
Pattern 2: cite spans
Every claim in your answer must be followed by a quoted span from the
document in <quote>...</quote> tags. If a claim has no supporting
quote, do not make it.
Combine both for high-stakes use (legal, compliance). For consumer-facing summaries, pattern 1 alone usually suffices.
When this pattern doesn't fit
| Situation | Better approach |
|---|---|
| Corpus > 200K tokens | Retrieval (BM25 or embeddings) before this recipe. See /v1/rerank to refine candidates. |
| Document changes per request | Don't bother caching. The create cost will dwarf any read savings. |
| Same doc, different users, <5 min apart | Caching works across users — same cache key, same hit. Just keep the prefix byte-identical. |
| Document contains user PII | Splice user-specific data into the user message, not the cached system block. Keeps the cache key stable and the user data out of long-lived cache. |