Home  /  Blog  /  Tool Use With Claude Through a Gateway

Tool Use With Claude Through a Gateway: Streaming, Errors, and Cost Patterns

Tool use is the core of every Claude agent. It is also where things quietly break when traffic flows through a gateway. This is a practical guide to the round trip, the streaming deltas, the retry boundaries, and the actual cost of one agent loop on BUZZ.

By BUZZ AI Gateway engineering · Updated 2026-05-22 · ~13 min read

Why tool use is the part that breaks

Plain text completion through an AI gateway is forgiving. The request goes up, tokens come back down, and even a sloppy proxy that buffers responses or rewrites JSON will usually still produce something usable. Tool use is different. It is a bidirectional protocol embedded in the same HTTP stream, with strict ordering between tool_use blocks the model produces and tool_result blocks the client must return. Any layer in the middle that touches the request body, normalizes JSON whitespace, drops unknown fields, or buffers the SSE stream until completion will eventually corrupt an agent loop in a way that is hard to reproduce.

That is why BUZZ treats tool-use traffic as a transparent forwarding case. The bytes you send are the bytes that arrive upstream. The bytes upstream returns are the bytes you receive. This article documents what that contract means in practice, where the failure modes hide, and what an agent loop costs once you account for the overhead Anthropic adds when tools are present.

What the tool-use round trip actually looks like

The Anthropic Messages API expresses tool use as content blocks rather than a side channel. A normal assistant turn returns a list of content blocks, each tagged with a type of text, tool_use, or thinking. When the model decides to call a tool, the response body contains an item like this:

{
  "id": "msg_01Abc",
  "type": "message",
  "role": "assistant",
  "model": "claude-sonnet-4-6",
  "stop_reason": "tool_use",
  "content": [
    {
      "type": "text",
      "text": "I'll look that up for you."
    },
    {
      "type": "tool_use",
      "id": "toolu_01XyZ",
      "name": "get_weather",
      "input": {"city": "Tokyo", "units": "metric"}
    }
  ],
  "usage": {"input_tokens": 412, "output_tokens": 87}
}

The client is then expected to execute the tool, capture the result, and start the next request with the assistant turn replayed verbatim and a new user turn that carries a matching tool_result block:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 1024,
  "tools": [...],
  "messages": [
    {"role": "user", "content": "What's the weather in Tokyo?"},
    {"role": "assistant", "content": [
      {"type": "text", "text": "I'll look that up for you."},
      {"type": "tool_use", "id": "toolu_01XyZ", "name": "get_weather",
       "input": {"city": "Tokyo", "units": "metric"}}
    ]},
    {"role": "user", "content": [
      {"type": "tool_result", "tool_use_id": "toolu_01XyZ",
       "content": "18C, partly cloudy"}
    ]}
  ]
}

Three properties of this protocol matter for gateways. First, the tool_use_id in the result must match the id the model emitted, byte for byte. Second, the assistant turn must be replayed exactly, with all blocks in the original order, otherwise the model loses its place and may either repeat the call or hallucinate that the tool already ran. Third, the conversation grows on every turn because the entire history is replayed, which is why prompt caching becomes load-bearing in long loops.

Through a gateway: what passes through unchanged

BUZZ forwards Claude tool-use traffic without altering any field that participates in the protocol. In particular:

The practical implication is that any agent harness that already runs against api.anthropic.com works against BUZZ by changing one variable. With the official Python SDK that is Anthropic(base_url="https://buzzai.cc", api_key=...). If you prefer the OpenAI-compatible surface for Claude, the base URL is https://buzzai.cc/v1 and tool use is forwarded through the OpenAI tool-call shape. For Claude Code itself, the one-line installer at https://buzzai.cc/sh/claudecode.sh points the CLI at the gateway without touching project files.

What BUZZ does not do. The gateway never injects extra tools, never strips a tool, never modifies input_schema, never rewrites tool_use_id, and never coalesces SSE chunks. If a tool call fails through BUZZ, the same call will fail the same way directly against Anthropic.

Streaming plus tool use: where it usually breaks

Streaming responses are emitted as Server-Sent Events. With tool use enabled, the same stream interleaves text deltas and tool-input deltas, and the client must reconstruct each block from its events. The minimum event shapes you have to handle are:

The trap is that input_json_delta chunks are not individually valid JSON. You only get a parseable object after concatenating every delta in a block and parsing the full string at content_block_stop. Naive proxies that try to "validate" JSON on the way through corrupt this stream. Naive clients that try to json.loads each chunk crash on the second character. Here is a streaming handler that does it correctly with the Anthropic Python SDK and BUZZ as the base URL:

import json
import os
from anthropic import Anthropic

client = Anthropic(
    base_url="https://buzzai.cc",
    api_key=os.environ["BUZZ_API_KEY"],
)

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a city.",
    "input_schema": {
        "type": "object",
        "properties": {
            "city": {"type": "string"},
            "units": {"type": "string", "enum": ["metric", "imperial"]},
        },
        "required": ["city"],
    },
}]

# Accumulator: index -> {"type": ..., "name": ..., "id": ..., "input_buf": ""}
blocks = {}

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Weather in Tokyo, metric please."}],
) as stream:
    for event in stream:
        t = event.type

        if t == "content_block_start":
            cb = event.content_block
            blocks[event.index] = {
                "type": cb.type,
                "name": getattr(cb, "name", None),
                "id": getattr(cb, "id", None),
                "input_buf": "",
                "text_buf": "",
            }

        elif t == "content_block_delta":
            d = event.delta
            block = blocks[event.index]
            if d.type == "text_delta":
                block["text_buf"] += d.text
                print(d.text, end="", flush=True)
            elif d.type == "input_json_delta":
                block["input_buf"] += d.partial_json

        elif t == "content_block_stop":
            block = blocks[event.index]
            if block["type"] == "tool_use":
                # Now and only now is the JSON complete.
                block["input"] = json.loads(block["input_buf"] or "{}")
                print(f"\n[tool_use] {block['name']}({block['input']})")

    final = stream.get_final_message()
    print(f"\nstop_reason={final.stop_reason}, usage={final.usage}")

Three things to note. The input_buf is only parsed at content_block_stop, never before. The id captured at content_block_start is what you must echo back as tool_use_id when you submit the result. And the SDK's get_final_message() gives you a fully assembled message object that you can append to messages as the assistant turn for the next iteration without rebuilding it manually.

Cost patterns: what one agent loop actually costs

Tool use adds two consistent overheads on top of your prompt and the model's response. Both are billed as input tokens.

To make this concrete, here is a representative single-turn call that triggers one tool: a 600-token user prompt, three tool definitions averaging 120 tokens each, the 346-token tool-use overhead, and a model response containing a 40-token text preamble plus a tool_use block whose JSON input is 30 tokens. The accounting on /api/pricing for Sonnet looks like this:

ComponentTokensBill side
User prompt600input
Tool-use system prompt346input
3 tool definitions360input
Subtotal input (turn 1)1306input
Assistant text + tool_use70output

Now the second turn. The client executes the tool, gets back, say, a 250-token result, and re-sends the entire history plus the tool_result block. The same 346-token overhead and 360-token definitions are charged again because they are part of every request that has tools set. The new input is the original 600-token user prompt, the 70-token assistant turn replayed verbatim, and the 250-token tool_result, for an input subtotal of 1626 tokens. If the model now answers in 180 output tokens, the full two-turn loop has cost 1306 + 70 + 1626 + 180 = 3182 tokens at Sonnet rates.

Two practical levers reduce that. First, add a cache_control breakpoint at the end of tools; on the second turn the 346 + 360 = 706 tokens of overhead read from cache at 0.1x the base input rate instead of full price. Second, mark the system prompt as cacheable too if you have one. For long agent loops with five to ten iterations, those two breakpoints typically cut the input bill in half. The exact multipliers and the live per-token rates are published on https://buzzai.cc/api/pricing; the gateway forwards cache_control markers unchanged, so caching works the same as it does directly against Anthropic.

Error handling and retry boundaries

An agent loop has two retry boundaries that do not move together. The first is the API call to the gateway. The second is the local execution of the tool itself. Conflating them is the most common source of duplicate side effects, such as sending a notification twice or charging a card twice when something half-failed.

Treat each boundary independently:

The discipline that ties this together: a tool execution failure is data the model needs, not an exception to escalate. Surfacing it as a structured tool_result keeps the conversation in a valid state. Throwing away the assistant turn and silently retrying is what produces duplicate charges and confused agents.

A realistic agent loop

Here is a complete agent loop, around 50 lines, that you can paste into a script. It connects to BUZZ, registers two tools, runs the conversation until the model stops calling tools, and enforces both a turn cap and a token cap. The body is deliberately compact so you can read it end to end.

import json
import os
from anthropic import Anthropic

client = Anthropic(base_url="https://buzzai.cc", api_key=os.environ["BUZZ_API_KEY"])

TOOLS = [
    {"name": "get_weather", "description": "Current weather for a city.",
     "input_schema": {"type": "object", "required": ["city"],
        "properties": {"city": {"type": "string"}}}},
    {"name": "search_docs", "description": "Search internal documentation.",
     "input_schema": {"type": "object", "required": ["q"],
        "properties": {"q": {"type": "string"}, "limit": {"type": "integer"}}}},
]

def execute_tool(name, args):
    try:
        if name == "get_weather":
            return f"{args['city']}: 18C, partly cloudy"
        if name == "search_docs":
            return f"Top result for {args['q']!r}: see internal-runbook#42"
        return {"is_error": True, "content": f"unknown tool {name}"}
    except Exception as e:
        return {"is_error": True, "content": f"{type(e).__name__}: {e}"}

def run_agent(user_prompt, max_turns=15, token_budget=200_000):
    messages = [{"role": "user", "content": user_prompt}]
    total_tokens = 0
    for turn in range(max_turns):
        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            tools=TOOLS,
            messages=messages,
        )
        total_tokens += resp.usage.input_tokens + resp.usage.output_tokens
        if total_tokens > token_budget:
            raise RuntimeError(f"token budget exceeded: {total_tokens}")

        # Replay the assistant turn verbatim.
        messages.append({"role": "assistant", "content": resp.content})

        if resp.stop_reason != "tool_use":
            return resp, messages

        # Build a single user turn with one tool_result per tool_use block.
        results = []
        for block in resp.content:
            if block.type != "tool_use":
                continue
            output = execute_tool(block.name, block.input)
            if isinstance(output, dict) and output.get("is_error"):
                results.append({"type": "tool_result", "tool_use_id": block.id,
                                "is_error": True, "content": output["content"]})
            else:
                results.append({"type": "tool_result", "tool_use_id": block.id,
                                "content": str(output)})
        messages.append({"role": "user", "content": results})
    raise RuntimeError(f"agent did not converge within {max_turns} turns")

if __name__ == "__main__":
    final, history = run_agent("What's the weather in Tokyo, and find docs on rate limiting.")
    text = next((b.text for b in final.content if b.type == "text"), "")
    print(text)

A few details worth pointing out. The assistant turn is appended as resp.content directly, the SDK objects round-trip cleanly back into the next request. Each tool_use block in a single assistant turn produces exactly one tool_result block in the following user turn, in the same order, all bundled into one user message. execute_tool never raises out of the loop; failures become structured is_error results so the conversation stays valid. The two caps, max_turns and token_budget, exist to bound runaway loops where the model keeps calling tools or where one tool returns a 50k-token blob that the model keeps re-reading.

To run this against the gateway with no other change, the base_url is already set. The BUZZ_API_KEY can be obtained from your account on https://buzzai.cc/, and the available model identifiers are listed on https://buzzai.cc/models.

FAQ

Does Claude tool use work the same way through a gateway?

Yes, when the gateway is transparent. BUZZ forwards the request body unchanged, including the tools array, tool_choice, and any tool_use or tool_result blocks in the messages history. SDK code that already runs against api.anthropic.com works by switching base_url to https://buzzai.cc.

Can I stream a response that contains tool calls?

Yes. Tool calls appear inside the same SSE stream as text. Each call is a content_block_start with tool_use, followed by input_json_delta events that incrementally build the JSON arguments, then content_block_stop. The gateway forwards every chunk without buffering.

How much does the tool definition itself cost in tokens?

Two costs. Claude prepends a tool-use system prompt that adds 346 input tokens per request on Claude 4.x. Each tool definition adds its own JSON schema text, typically 50 to 200 tokens. Both are billed as input on every request that includes tools, even if the model never calls a tool.

What happens if the model returns malformed JSON in tool input?

Catch the parse error, return a tool_result with is_error: true and a short message describing the failure. The model usually self-corrects on the next turn rather than retrying the same broken call.

Should I retry on a 5xx during a tool call?

Retry the API call with exponential backoff on 502, 503, 504, and 529. Do not retry the local tool execution. If the failure happened on your side, surface it as tool_result with is_error: true. Mixing the two retry boundaries is the most common source of duplicate side effects.

Does the gateway add latency to tool use?

BUZZ adds a small fixed forwarding overhead, typically a few tens of milliseconds, and streams chunks through as they arrive. Because tool-use loops are dominated by model think time and tool execution time, that overhead is not visible in end-to-end traces.

Are tool inputs and outputs stored by the gateway?

No. BUZZ operates with zero data retention. Request bodies, including tool arguments, and response bodies, including tool_use blocks, are not written to disk, databases, or logs. Only billing metadata is retained.

Can I mix tool use with prompt caching?

Yes, and for any non-trivial tool list you should. Place a cache_control breakpoint at the end of tools. Subsequent calls within the cache TTL pay the cache read rate (0.1x base input) for the entire tool definitions block, instead of full price for those 500 to 2000 tokens on every turn.

How do I cap an agent loop so it cannot run forever?

Two limits. A turn count cap, typically 10 to 25 iterations, that breaks the loop if the model keeps requesting tools. An aggregate token budget, checked from usage on each response, that aborts when the conversation crosses a threshold like 200000 tokens. The token cap protects against runaway cost when a tool returns very large outputs.

Which Claude models support tool use through BUZZ?

Every current Claude model on https://buzzai.cc/models, including Opus 4.8, Sonnet 4.6, and Haiku 4.5. Tool calling is part of the Messages API contract and is forwarded transparently.

Conclusion

Tool use is where agent systems earn or lose their reliability. The protocol is small but unforgiving: tool_use_id must round-trip exactly, the assistant turn must be replayed verbatim, the SSE stream must not be buffered, and the JSON in input_json_delta must be assembled before parsing. A transparent gateway does not get in the way of any of that. BUZZ forwards the bytes, preserves the streaming contract, retains nothing about the content, and prices tool-use overhead the same way Anthropic does, on top of a discounted base rate.

If you are starting from scratch, the agent loop above is a complete reference. If you already have an agent harness, switching to the gateway is one variable change, base_url="https://buzzai.cc" in Python, or the equivalent in TypeScript, and Claude Code via https://buzzai.cc/sh/claudecode.sh. Live per-token rates are on /api/pricing and the active model list is on /models.

Published: 2026-05-22