BUZZ AI Gateway
Docs · Recipes · Data Extraction

Data Extraction

Tool Use as a forced-JSON channel. Define an input_schema, set tool_choice to that tool, and Claude returns a structured object — no markdown fences, no commentary, no parsing tricks.

POST https://buzzai.cc/v1/messages
The trick. You're not actually going to "call" the tool. You're using input_schema as a JSON Schema constraint and tool_choice to force Claude to fill it. Pull the structured object out of the tool_use block and you're done.

Why this beats "respond with JSON"

Asking the model to "respond in JSON" works most of the time. Most. Production needs all of the time. Tool Use gives you:

Anatomy of an extraction request

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "tools": [
    {
      "name": "save_invoice",
      "description": "Save the extracted invoice to the database.",
      "input_schema": { ... your schema ... }
    }
  ],
  "tool_choice": {"type": "tool", "name": "save_invoice"},
  "messages": [
    {"role": "user", "content": "Extract structured data from:\n\n" + raw_text}
  ]
}

The key field is tool_choice: {type: "tool", name: "save_invoice"}. That forces Claude to call exactly that tool, no other and no plain text response. The response always comes back with stop_reason: "tool_use".

Schema field types you'll actually use

Strings

"name": {
  "type": "string",
  "description": "Customer's full name as written on the invoice."
}

For dates and IDs, use format hints — Claude reads them:

"invoice_date": {"type": "string", "format": "date"},
"customer_email": {"type": "string", "format": "email"}

Integers and numbers

"line_count": {"type": "integer", "minimum": 1, "maximum": 1000},
"total_usd": {"type": "number", "minimum": 0}

Enums

Closed sets are the highest-leverage typing. Don't ask for "category as a string", give the four valid values and let the model pick:

"status": {
  "type": "string",
  "enum": ["draft", "sent", "paid", "overdue"]
}

Booleans and nullables

"is_recurring": {"type": "boolean"},
"discount_pct": {"type": ["number", "null"], "minimum": 0, "maximum": 100}

Nested objects

"customer": {
  "type": "object",
  "properties": {
    "name":  {"type": "string"},
    "email": {"type": "string", "format": "email"},
    "address": {
      "type": "object",
      "properties": {
        "street": {"type": "string"},
        "city":   {"type": "string"},
        "country_iso2": {"type": "string", "minLength": 2, "maxLength": 2}
      },
      "required": ["street", "city", "country_iso2"]
    }
  },
  "required": ["name", "email"]
}

Arrays of objects

"line_items": {
  "type": "array",
  "minItems": 1,
  "items": {
    "type": "object",
    "properties": {
      "sku":      {"type": "string"},
      "quantity": {"type": "integer", "minimum": 1},
      "unit_price_usd": {"type": "number", "minimum": 0}
    },
    "required": ["sku", "quantity", "unit_price_usd"]
  }
}

Full working example: extract invoice fields

"""
Extract a structured invoice from raw OCR text.
Requires: pip install anthropic jsonschema
"""
import os, json
from anthropic import Anthropic
from jsonschema import Draft202012Validator, ValidationError

client = Anthropic(
    base_url="https://buzzai.cc",
    api_key=os.environ["BUZZ_API_KEY"],
)

INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "invoice_date":   {"type": "string", "format": "date"},
        "status": {
            "type": "string",
            "enum": ["draft", "sent", "paid", "overdue"],
        },
        "is_recurring": {"type": "boolean"},
        "customer": {
            "type": "object",
            "properties": {
                "name":  {"type": "string"},
                "email": {"type": "string", "format": "email"},
                "country_iso2": {
                    "type": "string", "minLength": 2, "maxLength": 2,
                },
            },
            "required": ["name", "email", "country_iso2"],
        },
        "line_items": {
            "type": "array",
            "minItems": 1,
            "items": {
                "type": "object",
                "properties": {
                    "sku":      {"type": "string"},
                    "quantity": {"type": "integer", "minimum": 1},
                    "unit_price_usd": {"type": "number", "minimum": 0},
                },
                "required": ["sku", "quantity", "unit_price_usd"],
            },
        },
        "total_usd": {"type": "number", "minimum": 0},
    },
    "required": [
        "invoice_number", "invoice_date", "status",
        "customer", "line_items", "total_usd",
    ],
}

EXTRACTOR_TOOL = {
    "name": "save_invoice",
    "description": "Save the extracted invoice to the database. Call this exactly once with the full extracted record.",
    "input_schema": INVOICE_SCHEMA,
}

VALIDATOR = Draft202012Validator(INVOICE_SCHEMA)


def extract(raw_text: str) -> dict:
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        tools=[EXTRACTOR_TOOL],
        tool_choice={"type": "tool", "name": "save_invoice"},
        messages=[
            {
                "role": "user",
                "content": (
                    "Extract every field from the invoice text below. "
                    "If a field is not present, omit it (do not guess). "
                    "Use ISO 8601 for dates and ISO-3166 alpha-2 for country.\n\n"
                    f"<INVOICE>\n{raw_text}\n</INVOICE>"
                ),
            }
        ],
    )

    tool_block = next((b for b in resp.content if b.type == "tool_use"), None)
    if tool_block is None:
        raise RuntimeError("model did not call the tool")

    data = tool_block.input

    # Validate against the schema. Surface all errors at once.
    errors = sorted(VALIDATOR.iter_errors(data), key=lambda e: e.path)
    if errors:
        msg = "\n".join(f"  - {list(e.path)}: {e.message}" for e in errors)
        raise ValidationError(f"Schema violations:\n{msg}")

    return data


if __name__ == "__main__":
    sample = """
    INVOICE #INV-2026-0042
    Date: 2026-05-20
    Bill to: Acme Robotics, hello@acme.example, US
    Status: paid (auto-debit, monthly recurring)

    SKU         Qty   Unit Price
    HW-WIDGET    3    19.99
    SW-LICENSE   1    99.00

    Total: $158.97
    """
    print(json.dumps(extract(sample), indent=2))
// Extract a structured invoice from raw text.
// Requires: npm i @anthropic-ai/sdk ajv ajv-formats
import Anthropic from "@anthropic-ai/sdk";
import Ajv from "ajv";
import addFormats from "ajv-formats";

const client = new Anthropic({
  baseURL: "https://buzzai.cc",
  apiKey: process.env.BUZZ_API_KEY,
});

const INVOICE_SCHEMA = {
  type: "object",
  properties: {
    invoice_number: { type: "string" },
    invoice_date: { type: "string", format: "date" },
    status: { type: "string", enum: ["draft", "sent", "paid", "overdue"] },
    is_recurring: { type: "boolean" },
    customer: {
      type: "object",
      properties: {
        name: { type: "string" },
        email: { type: "string", format: "email" },
        country_iso2: { type: "string", minLength: 2, maxLength: 2 },
      },
      required: ["name", "email", "country_iso2"],
    },
    line_items: {
      type: "array",
      minItems: 1,
      items: {
        type: "object",
        properties: {
          sku: { type: "string" },
          quantity: { type: "integer", minimum: 1 },
          unit_price_usd: { type: "number", minimum: 0 },
        },
        required: ["sku", "quantity", "unit_price_usd"],
      },
    },
    total_usd: { type: "number", minimum: 0 },
  },
  required: [
    "invoice_number", "invoice_date", "status",
    "customer", "line_items", "total_usd",
  ],
};

const EXTRACTOR_TOOL = {
  name: "save_invoice",
  description:
    "Save the extracted invoice to the database. " +
    "Call this exactly once with the full extracted record.",
  input_schema: INVOICE_SCHEMA,
};

const ajv = new Ajv({ allErrors: true, strict: false });
addFormats(ajv);
const validate = ajv.compile(INVOICE_SCHEMA);

export async function extract(rawText) {
  const resp = await client.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    tools: [EXTRACTOR_TOOL],
    tool_choice: { type: "tool", name: "save_invoice" },
    messages: [
      {
        role: "user",
        content:
          "Extract every field from the invoice text below. " +
          "If a field is not present, omit it (do not guess). " +
          "Use ISO 8601 for dates and ISO-3166 alpha-2 for country.\n\n" +
          `\n${rawText}\n`,
      },
    ],
  });

  const toolBlock = resp.content.find((b) => b.type === "tool_use");
  if (!toolBlock) throw new Error("model did not call the tool");

  const data = toolBlock.input;
  if (!validate(data)) {
    const msg = validate.errors
      .map((e) => `  - ${e.instancePath || "/"}: ${e.message}`)
      .join("\n");
    throw new Error(`Schema violations:\n${msg}`);
  }
  return data;
}

const sample = `
INVOICE #INV-2026-0042
Date: 2026-05-20
Bill to: Acme Robotics, hello@acme.example, US
Status: paid (auto-debit, monthly recurring)

SKU         Qty   Unit Price
HW-WIDGET    3    19.99
SW-LICENSE   1    99.00

Total: $158.97
`;
console.log(JSON.stringify(await extract(sample), null, 2));

Sample output

{
  "invoice_number": "INV-2026-0042",
  "invoice_date": "2026-05-20",
  "status": "paid",
  "is_recurring": true,
  "customer": {
    "name": "Acme Robotics",
    "email": "hello@acme.example",
    "country_iso2": "US"
  },
  "line_items": [
    {"sku": "HW-WIDGET",  "quantity": 3, "unit_price_usd": 19.99},
    {"sku": "SW-LICENSE", "quantity": 1, "unit_price_usd": 99.00}
  ],
  "total_usd": 158.97
}

Validating the model's output

The model usually obeys the schema. Usually. Validate before you trust the data:

What to do on validation failure

Three useful patterns, in order of effort:

  1. Retry once with the errors as feedback. Append the model's previous tool_use, then a user message containing the validation errors and "Call save_invoice again, fixing these issues." Works for ~90% of one-off failures.
  2. Drop the bad record into a quarantine table. For batch jobs where you can't block on per-record retries.
  3. Escalate the model. If Haiku fails validation, retry with Sonnet. If Sonnet fails, retry with Opus. Cost rises but so does compliance.
def extract_with_retry(raw_text, max_retries=2):
    messages = [{"role": "user", "content": f"Extract...\n\n{raw_text}"}]
    for attempt in range(max_retries + 1):
        resp = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=2048,
            tools=[EXTRACTOR_TOOL],
            tool_choice={"type": "tool", "name": "save_invoice"},
            messages=messages,
        )
        tool_block = next(b for b in resp.content if b.type == "tool_use")
        errors = list(VALIDATOR.iter_errors(tool_block.input))
        if not errors:
            return tool_block.input
        # Feed errors back and try again
        messages.append({"role": "assistant", "content": resp.content})
        messages.append({"role": "user", "content": [{
            "type": "tool_result",
            "tool_use_id": tool_block.id,
            "content": "Schema violations:\n" + "\n".join(
                f"- {list(e.path)}: {e.message}" for e in errors
            ) + "\n\nCall save_invoice again with corrected values.",
            "is_error": True,
        }]})
    raise ValueError("max retries exhausted")

Streaming partial JSON

For interactive UIs, stream the extraction. Each input_json_delta event carries a partial_json string fragment; concatenate them and parse incrementally with a tolerant JSON parser:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=2048,
    tools=[EXTRACTOR_TOOL],
    tool_choice={"type": "tool", "name": "save_invoice"},
    messages=[{"role": "user", "content": prompt}],
) as stream:
    buf = ""
    for event in stream:
        if (event.type == "content_block_delta" and
            event.delta.type == "input_json_delta"):
            buf += event.delta.partial_json
            # Render whatever you can parse so far in your UI
    final = json.loads(buf)

Use a streaming-tolerant parser (json5, partial-json, or write a small "complete to nearest closing brace" helper) to keep the UI updating before the final newline arrives.

Multi-record extraction

Two ways to extract many records from one document:

Single tool, array field

Define one extractor with an array as the top-level shape:

"input_schema": {
  "type": "object",
  "properties": {
    "records": {"type": "array", "items": { ... record schema ... }}
  },
  "required": ["records"]
}

Simplest. Best when records share the same schema and the document has a known upper bound (say, a 5-page invoice with at most 50 line items).

Loop with tool_choice "any"

For variable-shape extraction (different record types, unknown count), set tool_choice: {"type": "any"} and let the model emit multiple tool_use blocks. Iterate the loop and append tool_results acknowledging each save until the model emits end_turn.

Pick a model

ModelFit
claude-haiku-4-5-20251001Default for high-volume extraction with simple schemas. Fastest, cheapest. Validate every record.
claude-sonnet-4-6Complex schemas, deeply nested objects, ambiguous source text. Use as the retry tier above Haiku.
claude-opus-4-7Highest-stakes extraction (legal, medical) where missing or wrong fields are unacceptable. Worth enabling thinking.

See also