Data Extraction
Tool Use as a forced-JSON channel. Define an input_schema, set tool_choice to that tool, and Claude returns a structured object — no markdown fences, no commentary, no parsing tricks.
input_schema as a JSON Schema constraint and tool_choice to force Claude to fill it. Pull the structured object out of the tool_use block and you're done.
Why this beats "respond with JSON"
Asking the model to "respond in JSON" works most of the time. Most. Production needs all of the time. Tool Use gives you:
- Schema-typed fields. The model knows a field is an integer, an enum, or a nested object before it generates.
- No markdown fences or prose. The output is a JSON object inside a
tool_useblock, not a string you have to slice. - Required-field enforcement. The model is much more likely to populate fields you mark
requiredthan to obey a sentence in the prompt. - Streaming-friendly. You can stream
input_json_deltaevents and parse partial JSON for live UIs.
Anatomy of an extraction request
{
"model": "claude-sonnet-4-6",
"max_tokens": 1024,
"tools": [
{
"name": "save_invoice",
"description": "Save the extracted invoice to the database.",
"input_schema": { ... your schema ... }
}
],
"tool_choice": {"type": "tool", "name": "save_invoice"},
"messages": [
{"role": "user", "content": "Extract structured data from:\n\n" + raw_text}
]
}
The key field is tool_choice: {type: "tool", name: "save_invoice"}. That forces Claude to call exactly that tool, no other and no plain text response. The response always comes back with stop_reason: "tool_use".
Schema field types you'll actually use
Strings
"name": {
"type": "string",
"description": "Customer's full name as written on the invoice."
}
For dates and IDs, use format hints — Claude reads them:
"invoice_date": {"type": "string", "format": "date"},
"customer_email": {"type": "string", "format": "email"}
Integers and numbers
"line_count": {"type": "integer", "minimum": 1, "maximum": 1000},
"total_usd": {"type": "number", "minimum": 0}
Enums
Closed sets are the highest-leverage typing. Don't ask for "category as a string", give the four valid values and let the model pick:
"status": {
"type": "string",
"enum": ["draft", "sent", "paid", "overdue"]
}
Booleans and nullables
"is_recurring": {"type": "boolean"},
"discount_pct": {"type": ["number", "null"], "minimum": 0, "maximum": 100}
Nested objects
"customer": {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string", "format": "email"},
"address": {
"type": "object",
"properties": {
"street": {"type": "string"},
"city": {"type": "string"},
"country_iso2": {"type": "string", "minLength": 2, "maxLength": 2}
},
"required": ["street", "city", "country_iso2"]
}
},
"required": ["name", "email"]
}
Arrays of objects
"line_items": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"properties": {
"sku": {"type": "string"},
"quantity": {"type": "integer", "minimum": 1},
"unit_price_usd": {"type": "number", "minimum": 0}
},
"required": ["sku", "quantity", "unit_price_usd"]
}
}
Full working example: extract invoice fields
"""
Extract a structured invoice from raw OCR text.
Requires: pip install anthropic jsonschema
"""
import os, json
from anthropic import Anthropic
from jsonschema import Draft202012Validator, ValidationError
client = Anthropic(
base_url="https://buzzai.cc",
api_key=os.environ["BUZZ_API_KEY"],
)
INVOICE_SCHEMA = {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"invoice_date": {"type": "string", "format": "date"},
"status": {
"type": "string",
"enum": ["draft", "sent", "paid", "overdue"],
},
"is_recurring": {"type": "boolean"},
"customer": {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string", "format": "email"},
"country_iso2": {
"type": "string", "minLength": 2, "maxLength": 2,
},
},
"required": ["name", "email", "country_iso2"],
},
"line_items": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"properties": {
"sku": {"type": "string"},
"quantity": {"type": "integer", "minimum": 1},
"unit_price_usd": {"type": "number", "minimum": 0},
},
"required": ["sku", "quantity", "unit_price_usd"],
},
},
"total_usd": {"type": "number", "minimum": 0},
},
"required": [
"invoice_number", "invoice_date", "status",
"customer", "line_items", "total_usd",
],
}
EXTRACTOR_TOOL = {
"name": "save_invoice",
"description": "Save the extracted invoice to the database. Call this exactly once with the full extracted record.",
"input_schema": INVOICE_SCHEMA,
}
VALIDATOR = Draft202012Validator(INVOICE_SCHEMA)
def extract(raw_text: str) -> dict:
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=[EXTRACTOR_TOOL],
tool_choice={"type": "tool", "name": "save_invoice"},
messages=[
{
"role": "user",
"content": (
"Extract every field from the invoice text below. "
"If a field is not present, omit it (do not guess). "
"Use ISO 8601 for dates and ISO-3166 alpha-2 for country.\n\n"
f"<INVOICE>\n{raw_text}\n</INVOICE>"
),
}
],
)
tool_block = next((b for b in resp.content if b.type == "tool_use"), None)
if tool_block is None:
raise RuntimeError("model did not call the tool")
data = tool_block.input
# Validate against the schema. Surface all errors at once.
errors = sorted(VALIDATOR.iter_errors(data), key=lambda e: e.path)
if errors:
msg = "\n".join(f" - {list(e.path)}: {e.message}" for e in errors)
raise ValidationError(f"Schema violations:\n{msg}")
return data
if __name__ == "__main__":
sample = """
INVOICE #INV-2026-0042
Date: 2026-05-20
Bill to: Acme Robotics, hello@acme.example, US
Status: paid (auto-debit, monthly recurring)
SKU Qty Unit Price
HW-WIDGET 3 19.99
SW-LICENSE 1 99.00
Total: $158.97
"""
print(json.dumps(extract(sample), indent=2))
// Extract a structured invoice from raw text.
// Requires: npm i @anthropic-ai/sdk ajv ajv-formats
import Anthropic from "@anthropic-ai/sdk";
import Ajv from "ajv";
import addFormats from "ajv-formats";
const client = new Anthropic({
baseURL: "https://buzzai.cc",
apiKey: process.env.BUZZ_API_KEY,
});
const INVOICE_SCHEMA = {
type: "object",
properties: {
invoice_number: { type: "string" },
invoice_date: { type: "string", format: "date" },
status: { type: "string", enum: ["draft", "sent", "paid", "overdue"] },
is_recurring: { type: "boolean" },
customer: {
type: "object",
properties: {
name: { type: "string" },
email: { type: "string", format: "email" },
country_iso2: { type: "string", minLength: 2, maxLength: 2 },
},
required: ["name", "email", "country_iso2"],
},
line_items: {
type: "array",
minItems: 1,
items: {
type: "object",
properties: {
sku: { type: "string" },
quantity: { type: "integer", minimum: 1 },
unit_price_usd: { type: "number", minimum: 0 },
},
required: ["sku", "quantity", "unit_price_usd"],
},
},
total_usd: { type: "number", minimum: 0 },
},
required: [
"invoice_number", "invoice_date", "status",
"customer", "line_items", "total_usd",
],
};
const EXTRACTOR_TOOL = {
name: "save_invoice",
description:
"Save the extracted invoice to the database. " +
"Call this exactly once with the full extracted record.",
input_schema: INVOICE_SCHEMA,
};
const ajv = new Ajv({ allErrors: true, strict: false });
addFormats(ajv);
const validate = ajv.compile(INVOICE_SCHEMA);
export async function extract(rawText) {
const resp = await client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 2048,
tools: [EXTRACTOR_TOOL],
tool_choice: { type: "tool", name: "save_invoice" },
messages: [
{
role: "user",
content:
"Extract every field from the invoice text below. " +
"If a field is not present, omit it (do not guess). " +
"Use ISO 8601 for dates and ISO-3166 alpha-2 for country.\n\n" +
`\n${rawText}\n `,
},
],
});
const toolBlock = resp.content.find((b) => b.type === "tool_use");
if (!toolBlock) throw new Error("model did not call the tool");
const data = toolBlock.input;
if (!validate(data)) {
const msg = validate.errors
.map((e) => ` - ${e.instancePath || "/"}: ${e.message}`)
.join("\n");
throw new Error(`Schema violations:\n${msg}`);
}
return data;
}
const sample = `
INVOICE #INV-2026-0042
Date: 2026-05-20
Bill to: Acme Robotics, hello@acme.example, US
Status: paid (auto-debit, monthly recurring)
SKU Qty Unit Price
HW-WIDGET 3 19.99
SW-LICENSE 1 99.00
Total: $158.97
`;
console.log(JSON.stringify(await extract(sample), null, 2));
Sample output
{
"invoice_number": "INV-2026-0042",
"invoice_date": "2026-05-20",
"status": "paid",
"is_recurring": true,
"customer": {
"name": "Acme Robotics",
"email": "hello@acme.example",
"country_iso2": "US"
},
"line_items": [
{"sku": "HW-WIDGET", "quantity": 3, "unit_price_usd": 19.99},
{"sku": "SW-LICENSE", "quantity": 1, "unit_price_usd": 99.00}
],
"total_usd": 158.97
}
Validating the model's output
The model usually obeys the schema. Usually. Validate before you trust the data:
- Python:
jsonschemawithDraft202012Validator. - Node:
ajvwithajv-formats. - If you already use a typed model layer (Pydantic, Zod), generate the JSON Schema from it and validate at the same boundary.
What to do on validation failure
Three useful patterns, in order of effort:
- Retry once with the errors as feedback. Append the model's previous tool_use, then a user message containing the validation errors and "Call save_invoice again, fixing these issues." Works for ~90% of one-off failures.
- Drop the bad record into a quarantine table. For batch jobs where you can't block on per-record retries.
- Escalate the model. If Haiku fails validation, retry with Sonnet. If Sonnet fails, retry with Opus. Cost rises but so does compliance.
def extract_with_retry(raw_text, max_retries=2):
messages = [{"role": "user", "content": f"Extract...\n\n{raw_text}"}]
for attempt in range(max_retries + 1):
resp = client.messages.create(
model="claude-sonnet-4-6", max_tokens=2048,
tools=[EXTRACTOR_TOOL],
tool_choice={"type": "tool", "name": "save_invoice"},
messages=messages,
)
tool_block = next(b for b in resp.content if b.type == "tool_use")
errors = list(VALIDATOR.iter_errors(tool_block.input))
if not errors:
return tool_block.input
# Feed errors back and try again
messages.append({"role": "assistant", "content": resp.content})
messages.append({"role": "user", "content": [{
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": "Schema violations:\n" + "\n".join(
f"- {list(e.path)}: {e.message}" for e in errors
) + "\n\nCall save_invoice again with corrected values.",
"is_error": True,
}]})
raise ValueError("max retries exhausted")
Streaming partial JSON
For interactive UIs, stream the extraction. Each input_json_delta event carries a partial_json string fragment; concatenate them and parse incrementally with a tolerant JSON parser:
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=[EXTRACTOR_TOOL],
tool_choice={"type": "tool", "name": "save_invoice"},
messages=[{"role": "user", "content": prompt}],
) as stream:
buf = ""
for event in stream:
if (event.type == "content_block_delta" and
event.delta.type == "input_json_delta"):
buf += event.delta.partial_json
# Render whatever you can parse so far in your UI
final = json.loads(buf)
Use a streaming-tolerant parser (json5, partial-json, or write a small "complete to nearest closing brace" helper) to keep the UI updating before the final newline arrives.
Multi-record extraction
Two ways to extract many records from one document:
Single tool, array field
Define one extractor with an array as the top-level shape:
"input_schema": {
"type": "object",
"properties": {
"records": {"type": "array", "items": { ... record schema ... }}
},
"required": ["records"]
}
Simplest. Best when records share the same schema and the document has a known upper bound (say, a 5-page invoice with at most 50 line items).
Loop with tool_choice "any"
For variable-shape extraction (different record types, unknown count), set tool_choice: {"type": "any"} and let the model emit multiple tool_use blocks. Iterate the loop and append tool_results acknowledging each save until the model emits end_turn.
Pick a model
| Model | Fit |
|---|---|
claude-haiku-4-5-20251001 | Default for high-volume extraction with simple schemas. Fastest, cheapest. Validate every record. |
claude-sonnet-4-6 | Complex schemas, deeply nested objects, ambiguous source text. Use as the retry tier above Haiku. |
claude-opus-4-7 | Highest-stakes extraction (legal, medical) where missing or wrong fields are unacceptable. Worth enabling thinking. |
See also
- Tool Use concept
- POST /v1/messages reference
- Recipe: Agent Loop — for iterative validate-and-retry flows
- JSON Schema specification