Docs · Recipes · Data Extraction

Data Extraction

Tool Use as a forced-JSON channel. Define an input_schema, set tool_choice to that tool, and Claude returns a structured object — no markdown fences, no commentary, no parsing tricks.

POST https://buzzai.cc/v1/messages

The trick. You're not actually going to "call" the tool. You're using input_schema as a JSON Schema constraint and tool_choice to force Claude to fill it. Pull the structured object out of the tool_use block and you're done.

Why this beats "respond with JSON"

Asking the model to "respond in JSON" works most of the time. Most. Production needs all of the time. Tool Use gives you:

Schema-typed fields. The model knows a field is an integer, an enum, or a nested object before it generates.
No markdown fences or prose. The output is a JSON object inside a tool_use block, not a string you have to slice.
Required-field enforcement. The model is much more likely to populate fields you mark required than to obey a sentence in the prompt.
Streaming-friendly. You can stream input_json_delta events and parse partial JSON for live UIs.

Anatomy of an extraction request

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "tools": [
    {
      "name": "save_invoice",
      "description": "Save the extracted invoice to the database.",
      "input_schema": { ... your schema ... }
    }
  ],
  "tool_choice": {"type": "tool", "name": "save_invoice"},
  "messages": [
    {"role": "user", "content": "Extract structured data from:\n\n" + raw_text}
  ]
}

The key field is tool_choice: {type: "tool", name: "save_invoice"}. That forces Claude to call exactly that tool, no other and no plain text response. The response always comes back with stop_reason: "tool_use".

Schema field types you'll actually use

Strings

"name": {
  "type": "string",
  "description": "Customer's full name as written on the invoice."
}

For dates and IDs, use format hints — Claude reads them:

"invoice_date": {"type": "string", "format": "date"},
"customer_email": {"type": "string", "format": "email"}

Integers and numbers

"line_count": {"type": "integer", "minimum": 1, "maximum": 1000},
"total_usd": {"type": "number", "minimum": 0}

Enums

Closed sets are the highest-leverage typing. Don't ask for "category as a string", give the four valid values and let the model pick:

"status": {
  "type": "string",
  "enum": ["draft", "sent", "paid", "overdue"]
}

Booleans and nullables

"is_recurring": {"type": "boolean"},
"discount_pct": {"type": ["number", "null"], "minimum": 0, "maximum": 100}

Nested objects

"customer": {
  "type": "object",
  "properties": {
    "name":  {"type": "string"},
    "email": {"type": "string", "format": "email"},
    "address": {
      "type": "object",
      "properties": {
        "street": {"type": "string"},
        "city":   {"type": "string"},
        "country_iso2": {"type": "string", "minLength": 2, "maxLength": 2}
      },
      "required": ["street", "city", "country_iso2"]
    }
  },
  "required": ["name", "email"]
}

Arrays of objects

"line_items": {
  "type": "array",
  "minItems": 1,
  "items": {
    "type": "object",
    "properties": {
      "sku":      {"type": "string"},
      "quantity": {"type": "integer", "minimum": 1},
      "unit_price_usd": {"type": "number", "minimum": 0}
    },
    "required": ["sku", "quantity", "unit_price_usd"]
  }
}

Full working example: extract invoice fields

"""
Extract a structured invoice from raw OCR text.
Requires: pip install anthropic jsonschema
"""
import os, json
from anthropic import Anthropic
from jsonschema import Draft202012Validator, ValidationError

client = Anthropic(
    base_url="https://buzzai.cc",
    api_key=os.environ["BUZZ_API_KEY"],
)

INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "invoice_date":   {"type": "string", "format": "date"},
        "status": {
            "type": "string",
            "enum": ["draft", "sent", "paid", "overdue"],
        },
        "is_recurring": {"type": "boolean"},
        "customer": {
            "type": "object",
            "properties": {
                "name":  {"type": "string"},
                "email": {"type": "string", "format": "email"},
                "country_iso2": {
                    "type": "string", "minLength": 2, "maxLength": 2,
                },
            },
            "required": ["name", "email", "country_iso2"],
        },
        "line_items": {
            "type": "array",
            "minItems": 1,
            "items": {
                "type": "object",
                "properties": {
                    "sku":      {"type": "string"},
                    "quantity": {"type": "integer", "minimum": 1},
                    "unit_price_usd": {"type": "number", "minimum": 0},
                },
                "required": ["sku", "quantity", "unit_price_usd"],
            },
        },
        "total_usd": {"type": "number", "minimum": 0},
    },
    "required": [
        "invoice_number", "invoice_date", "status",
        "customer", "line_items", "total_usd",
    ],
}

EXTRACTOR_TOOL = {
    "name": "save_invoice",
    "description": "Save the extracted invoice to the database. Call this exactly once with the full extracted record.",
    "input_schema": INVOICE_SCHEMA,
}

VALIDATOR = Draft202012Validator(INVOICE_SCHEMA)


def extract(raw_text: str) -> dict:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        tools=[EXTRACTOR_TOOL],
        tool_choice={"type": "tool", "name": "save_invoice"},
        messages=[
            {
                "role": "user",
                "content": (
                    "Extract every field from the invoice text below. "
                    "If a field is not present, omit it (do not guess). "
                    "Use ISO 8601 for dates and ISO-3166 alpha-2 for country.\n\n"
                    f"<INVOICE>\n{raw_text}\n</INVOICE>"
                ),
            }
        ],
    )

    tool_block = next((b for b in resp.content if b.type == "tool_use"), None)
    if tool_block is None:
        raise RuntimeError("model did not call the tool")

    data = tool_block.input

    # Validate against the schema. Surface all errors at once.
    errors = sorted(VALIDATOR.iter_errors(data), key=lambda e: e.path)
    if errors:
        msg = "\n".join(f"  - {list(e.path)}: {e.message}" for e in errors)
        raise ValidationError(f"Schema violations:\n{msg}")

    return data


if __name__ == "__main__":
    sample = """
    INVOICE #INV-2026-0042
    Date: 2026-05-20
    Bill to: Acme Robotics, hello@acme.example, US
    Status: paid (auto-debit, monthly recurring)

    SKU         Qty   Unit Price
    HW-WIDGET    3    19.99
    SW-LICENSE   1    99.00

    Total: $158.97
    """
    print(json.dumps(extract(sample), indent=2))

// Extract a structured invoice from raw text.
// Requires: npm i @anthropic-ai/sdk ajv ajv-formats
import Anthropic from "@anthropic-ai/sdk";
import Ajv from "ajv";
import addFormats from "ajv-formats";

const client = new Anthropic({
  baseURL: "https://buzzai.cc",
  apiKey: process.env.BUZZ_API_KEY,
});

const INVOICE_SCHEMA = {
  type: "object",
  properties: {
    invoice_number: { type: "string" },
    invoice_date: { type: "string", format: "date" },
    status: { type: "string", enum: ["draft", "sent", "paid", "overdue"] },
    is_recurring: { type: "boolean" },
    customer: {
      type: "object",
      properties: {
        name: { type: "string" },
        email: { type: "string", format: "email" },
        country_iso2: { type: "string", minLength: 2, maxLength: 2 },
      },
      required: ["name", "email", "country_iso2"],
    },
    line_items: {
      type: "array",
      minItems: 1,
      items: {
        type: "object",
        properties: {
          sku: { type: "string" },
          quantity: { type: "integer", minimum: 1 },
          unit_price_usd: { type: "number", minimum: 0 },
        },
        required: ["sku", "quantity", "unit_price_usd"],
      },
    },
    total_usd: { type: "number", minimum: 0 },
  },
  required: [
    "invoice_number", "invoice_date", "status",
    "customer", "line_items", "total_usd",
  ],
};

const EXTRACTOR_TOOL = {
  name: "save_invoice",
  description:
    "Save the extracted invoice to the database. " +
    "Call this exactly once with the full extracted record.",
  input_schema: INVOICE_SCHEMA,
};

const ajv = new Ajv({ allErrors: true, strict: false });
addFormats(ajv);
const validate = ajv.compile(INVOICE_SCHEMA);

export async function extract(rawText) {
  const resp = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    tools: [EXTRACTOR_TOOL],
    tool_choice: { type: "tool", name: "save_invoice" },
    messages: [
      {
        role: "user",
        content:
          "Extract every field from the invoice text below. " +
          "If a field is not present, omit it (do not guess). " +
          "Use ISO 8601 for dates and ISO-3166 alpha-2 for country.\n\n" +
          `\n${rawText}\n`,
      },
    ],
  });

  const toolBlock = resp.content.find((b) => b.type === "tool_use");
  if (!toolBlock) throw new Error("model did not call the tool");

  const data = toolBlock.input;
  if (!validate(data)) {
    const msg = validate.errors
      .map((e) => `  - ${e.instancePath || "/"}: ${e.message}`)
      .join("\n");
    throw new Error(`Schema violations:\n${msg}`);
  }
  return data;
}

const sample = `
INVOICE #INV-2026-0042
Date: 2026-05-20
Bill to: Acme Robotics, hello@acme.example, US
Status: paid (auto-debit, monthly recurring)

SKU         Qty   Unit Price
HW-WIDGET    3    19.99
SW-LICENSE   1    99.00

Total: $158.97
`;
console.log(JSON.stringify(await extract(sample), null, 2));

Sample output

{
  "invoice_number": "INV-2026-0042",
  "invoice_date": "2026-05-20",
  "status": "paid",
  "is_recurring": true,
  "customer": {
    "name": "Acme Robotics",
    "email": "hello@acme.example",
    "country_iso2": "US"
  },
  "line_items": [
    {"sku": "HW-WIDGET",  "quantity": 3, "unit_price_usd": 19.99},
    {"sku": "SW-LICENSE", "quantity": 1, "unit_price_usd": 99.00}
  ],
  "total_usd": 158.97
}

Validating the model's output

The model usually obeys the schema. Usually. Validate before you trust the data:

Python: jsonschema with Draft202012Validator.
Node: ajv with ajv-formats.
If you already use a typed model layer (Pydantic, Zod), generate the JSON Schema from it and validate at the same boundary.

What to do on validation failure

Three useful patterns, in order of effort:

Retry once with the errors as feedback. Append the model's previous tool_use, then a user message containing the validation errors and "Call save_invoice again, fixing these issues." Works for ~90% of one-off failures.
Drop the bad record into a quarantine table. For batch jobs where you can't block on per-record retries.
Escalate the model. If Haiku fails validation, retry with Sonnet. If Sonnet fails, retry with Opus. Cost rises but so does compliance.

def extract_with_retry(raw_text, max_retries=2):
    messages = [{"role": "user", "content": f"Extract...\n\n{raw_text}"}]
    for attempt in range(max_retries + 1):
        resp = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=2048,
            tools=[EXTRACTOR_TOOL],
            tool_choice={"type": "tool", "name": "save_invoice"},
            messages=messages,
        )
        tool_block = next(b for b in resp.content if b.type == "tool_use")
        errors = list(VALIDATOR.iter_errors(tool_block.input))
        if not errors:
            return tool_block.input
        # Feed errors back and try again
        messages.append({"role": "assistant", "content": resp.content})
        messages.append({"role": "user", "content": [{
            "type": "tool_result",
            "tool_use_id": tool_block.id,
            "content": "Schema violations:\n" + "\n".join(
                f"- {list(e.path)}: {e.message}" for e in errors
            ) + "\n\nCall save_invoice again with corrected values.",
            "is_error": True,
        }]})
    raise ValueError("max retries exhausted")

Streaming partial JSON

For interactive UIs, stream the extraction. Each input_json_delta event carries a partial_json string fragment; concatenate them and parse incrementally with a tolerant JSON parser:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    tools=[EXTRACTOR_TOOL],
    tool_choice={"type": "tool", "name": "save_invoice"},
    messages=[{"role": "user", "content": prompt}],
) as stream:
    buf = ""
    for event in stream:
        if (event.type == "content_block_delta" and
            event.delta.type == "input_json_delta"):
            buf += event.delta.partial_json
            # Render whatever you can parse so far in your UI
    final = json.loads(buf)

Use a streaming-tolerant parser (json5, partial-json, or write a small "complete to nearest closing brace" helper) to keep the UI updating before the final newline arrives.

Multi-record extraction

Two ways to extract many records from one document:

Single tool, array field

Define one extractor with an array as the top-level shape:

"input_schema": {
  "type": "object",
  "properties": {
    "records": {"type": "array", "items": { ... record schema ... }}
  },
  "required": ["records"]
}

Simplest. Best when records share the same schema and the document has a known upper bound (say, a 5-page invoice with at most 50 line items).

Loop with tool_choice "any"

For variable-shape extraction (different record types, unknown count), set tool_choice: {"type": "any"} and let the model emit multiple tool_use blocks. Iterate the loop and append tool_results acknowledging each save until the model emits end_turn.

Pick a model

Model	Fit
`claude-haiku-4-5-20251001`	Default for high-volume extraction with simple schemas. Fastest, cheapest. Validate every record.
`claude-sonnet-4-6`	Complex schemas, deeply nested objects, ambiguous source text. Use as the retry tier above Haiku.
`claude-opus-4-7`	Highest-stakes extraction (legal, medical) where missing or wrong fields are unacceptable. Worth enabling `thinking`.

Data Extraction

Why this beats "respond with JSON"

Anatomy of an extraction request

Schema field types you'll actually use

Strings

Integers and numbers

Enums

Booleans and nullables

Nested objects

Arrays of objects

Full working example: extract invoice fields

Sample output

Validating the model's output

What to do on validation failure

Streaming partial JSON

Multi-record extraction

Single tool, array field

Loop with tool_choice "any"

Pick a model

See also