Cutting Claude Code Costs Without Losing Capability: Routing Through a Gateway
Claude Code is the most capable agentic coding CLI most engineers have ever used, and it bills like it. A single base URL change keeps every feature working and lowers the per-token rate at the same time.
Claude Code earns its reputation. It reads your repo, edits files, runs commands, calls MCP servers, and chains tool calls inside one conversation until the task is actually finished. The cost of that capability is paid in tokens. A focused afternoon of agentic work can put a serious dent in an Anthropic balance, and a team running it on a daily basis quickly notices.
The good news is that the cost lever is small. Claude Code reads ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN from the environment. Pointing it at the BUZZ AI Gateway takes one line in a terminal, leaves the CLI behavior identical, and changes the billing destination to a per-token rate that is below the rates published on the official pricing page. Nothing about the code you send, the prompts the agent assembles, or the streamed responses changes. Only the meter changes.
This piece walks through why Claude Code burns through credits the way it does, the two real strategies for cutting that bill, the sixty-second setup, and what stays the same versus what changes when the gateway is in front.
Why Claude Code burns through API credits
The model price is the same per token regardless of who is calling it, but Claude Code has a usage shape that multiplies tokens hard. It is worth understanding the loop before optimizing it.
Each turn in an agentic session is a full request. The CLI takes the system prompt, the tool definitions, the entire running conversation, the file contents the agent has loaded, and any tool results, and posts the whole thing to the Messages API. The model replies with a mix of assistant text and tool use blocks. Claude Code executes the tools locally, appends the results to the conversation, and posts again. That cycle is the agentic loop.
What this means in practice:
- Input tokens climb fast. Every previous tool result, every file the agent read, and every prior message stays in the context window for the rest of the session.
- Output tokens are smaller per turn but the turns are frequent. A single user instruction can fan out to ten or twenty model calls behind the scenes.
- Reasoning models spend extra output tokens on thinking blocks that you do not see in the final reply but that you are billed for.
A back-of-the-envelope estimate makes the shape clear. Suppose a focused day of Claude Code work runs about fifty agent turns, with an average of fifty thousand input tokens per turn (a moderately full context) and five thousand output tokens per turn. At an Opus-class rate of five dollars per million input tokens and twenty-five dollars per million output tokens, the math is:
cost_per_turn = (50,000 * $5 + 5,000 * $25) / 1,000,000
= ($250,000 + $125,000) / 1,000,000
= $0.375
cost_per_day = 50 turns * $0.375
= $18.75
That is a single engineer on a single day, with average turn sizes that are easy to exceed once the agent loads several files into context. Scale to a team, or to one engineer who runs Claude Code in the background while solving harder problems, and the monthly bill is no longer a rounding error.
Two things compound this further. First, prompt caching helps a lot but only when the prefix of the request stays stable, and Claude Code rewrites parts of the conversation as the agent edits files. Second, Opus burns roughly five times the input rate and three times the output rate of Sonnet, so any session pinned to Opus inherits a much steeper curve.
Two strategies for cutting cost
There are two clean ways to bring this number down. They are not exclusive. Most teams end up using both.
1. Tier the model to the task
Not every Claude Code turn deserves Opus. The strongest reasoning model is the right choice when the task is genuinely hard: ambiguous refactors across many files, tricky concurrency bugs, novel architecture decisions. The rest of the time, Sonnet handles agentic coding well, and Haiku is enough for simple file edits, regex work, or batch reformatting.
The strategy:
- Default to Sonnet for daily development.
- Reserve Opus for the cases where Sonnet has actually failed or where the cost of the wrong answer is high.
- Drop to Haiku for cleanup tasks where you already know the shape of the change.
The downside of model tiering by itself is that it changes the way you work. You have to remember to switch tiers, and you sometimes pay the cost of a Sonnet attempt that fails before falling back to Opus. It is a real lever but a partial one.
2. Route through a gateway
The other lever is to change the meter, not the model. BUZZ AI is a gateway in front of the Anthropic API. The Claude Code CLI sends the same requests, the gateway forwards them to Anthropic, and the response streams back. Per-token rates on BUZZ are below the rates published on the official pricing page; the live numbers for every model live on the pricing endpoint and the models page.
The shape of the saving is simple: every token Claude Code would have sent to Anthropic still goes to Anthropic, but the bill goes to BUZZ at a lower rate. There is no behavior change to adapt to, no decision to remember at the keyboard, and the change is reversible in seconds.
The strongest version of the strategy uses both levers at once. Tier the model where it makes sense, route through the gateway for everything.
Setting up Claude Code with BUZZ in 60 seconds
One command does the whole setup:
curl -fsSL https://buzzai.cc/sh/claudecode.sh | bash
The install script runs entirely in your home directory and never asks for sudo unless your system package manager requires it. Here is what it does, end to end:
- Detects your shell environment. macOS, Linux, WSL, Alpine, and Git Bash on Windows are all handled. The script picks the right shell config (
.zshrc,.bashrc,.bash_profile, or.profile) for the persistent settings. - Installs the Claude Code CLI if missing. It prefers Anthropic's native installer (the same one Anthropic publishes at
claude.ai/install.sh), falls back to platform package managers (Homebrew, apt, dnf, apk, winget), and finally falls back to npm if nothing else works. If Claude Code is already installed via a different method, it offers to upgrade to the native build. - Configures the gateway endpoint. It writes
ANTHROPIC_BASE_URLandANTHROPIC_AUTH_TOKENto your shell config and validates the key against the gateway's/v1/modelsendpoint before saving. If validation fails, it tells you why instead of silently moving on. - Installs the balance status line plugin. A small script (Node, Python, or pure Bash, whichever runtime you have) runs as Claude Code's
statusLinecommand. It reads your remaining balance from the gateway every minute and renders it next to the model name in the CLI. The result is a live cost meter without leaving the terminal. - Sets the onboarding flag. It marks
~/.claude.jsonas onboarded so the CLI does not prompt you to log in via the official browser flow.
After the script finishes, open a new terminal so the new environment variables load, then run claude in any project. The status line shows the active model and the remaining USD balance. That is verification: if the balance renders, the gateway is being called.
curl -fsSL https://buzzai.cc/sh/claudecode.sh | bash -s -- \
--yes --api-key=YOUR_BUZZ_KEY --base-url=https://buzzai.cc
The script also accepts --install-only, --configure-only, --statusline-only, and --uninstall for partial runs.
What stays the same, what changes
This is the part that matters. A relay that quietly rewrites prompts or downgrades models is not actually saving you money; it is changing what you are paying for. The gateway model that is worth using is the one that is fully transparent. Here is the line-by-line:
| Aspect | Direct to Anthropic | Through BUZZ Gateway |
|---|---|---|
| Model behavior | Anthropic Messages API | Identical. Forwarded as-is. |
| Extended thinking | Supported | Supported. thinking param and blocks pass through. |
| Tool use | Supported, streamed | Supported, streamed byte-for-byte. |
| Prompt caching | Supported | Supported. Cache reads billed at cached rate. |
| MCP servers | Local to Claude Code | Unchanged. Gateway only handles the model call. |
| SDK compatibility | Anthropic SDK | Anthropic SDK and OpenAI SDK both work. |
| Model selection | You pick | You pick. No silent swaps. Opus stays Opus. |
| Per-token price | Official rates | Below official rates (see pricing) |
| Billing destination | Anthropic | BUZZ |
| Auth header | x-api-key: sk-ant-... | Same Anthropic-style header, BUZZ-issued key. |
| Request retention | Per Anthropic policy | Zero. Only billing metadata stored. |
The summary line is short: behavior identical, billing different. If you select Opus in Claude Code, Opus is what runs. The gateway forwards the request body without rewriting the model field, the system prompt, or the tool definitions. There is no shadow router that quietly downgrades hard requests to Sonnet to save money. The only safe gateway is one where the request you send is the request the upstream model receives.
Pricing tier strategy in Claude Code
Once cost is no longer a sharp constraint, the more interesting question is which model to ask for in the first place. A useful rough split:
Opus tier — when the answer matters more than the speed
- Architectural decisions and design reviews where wrong is expensive.
- Cross-file refactors that touch type systems, public APIs, or threading.
- Hairy debugging where the failure mode is non-obvious.
- Initial planning passes where you want the agent to think rather than rush.
Sonnet tier — the daily driver
- Feature work in a known codebase.
- Test writing, especially when scaffolding from existing tests.
- Routine bug fixes with a clear repro.
- Code review on a focused diff.
Haiku tier — when the shape is already known
- Bulk regex-style edits and renames where no judgement is required.
- Boilerplate generation from a clearly specified template.
- Quick lookups and one-shot questions where speed beats depth.
Combine this with prompt caching. The system prompt and tool definitions Claude Code sends are stable; mark them as cacheable and the cached input rate kicks in on subsequent turns. On long sessions, that compounds. The gateway bills cache reads at the cached rate, so the saving is real, not theoretical.
Common questions
Can I keep using my official Anthropic key for non-Claude-Code work?
Yes. The gateway is opt-in per terminal. Claude Code reads ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN from the shell environment, so the BUZZ values only apply where they are exported. Other tools that use the official Anthropic SDK with their own ANTHROPIC_API_KEY continue to call api.anthropic.com untouched. If you want fine-grained control, set the gateway variables only in the project shell where you run Claude Code.
Does extended thinking still work through the gateway?
Yes. The thinking parameter, the reasoning budget, and the streamed thinking blocks all pass through unchanged. Anything Claude Code can ask for, including the visible reasoning summaries, comes back the same way it would directly from Anthropic.
Are tool calls still streaming?
Yes. The gateway streams server-sent events as they arrive from the upstream. Tool use blocks, partial JSON arguments, and content chunks are forwarded byte for byte, so Claude Code sees the same incremental output and the same first-token latency profile it would over the official endpoint.
How do I switch back to the official Anthropic endpoint?
Two ways. For a one-shot override, run unset ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN in the current shell, then set ANTHROPIC_API_KEY to your official key. For a permanent switch, run bash claudecode.sh --uninstall; the script removes the gateway environment lines from your shell config and uninstalls the status line plugin. The Claude Code CLI itself is left in place.
Is there a usage dashboard?
Yes. The BUZZ dashboard shows per-token usage broken down by model, time window, and individual API key. The status line plugin installed by the install script also shows live remaining balance directly inside the Claude Code CLI, color-coded as it gets low. No separate web tab needed for the daily check.
What about the source code I send to Claude Code?
BUZZ runs a zero-retention policy. Request and response bodies are not stored. Only billing metadata is recorded: model name, input and output token counts, cache read and write counts, status code, and timestamp. The bytes flow through and disappear. If your team has a hard rule that source must not sit on third-party storage, the gateway model is compatible with that rule.
Will Claude Code MCP servers still work?
Yes. MCP servers run as local processes inside Claude Code. The CLI talks to them over stdio or a local socket and forwards their tool definitions to the model in the same Messages API request it always used. The gateway only sees the model call. Switching the base URL does not change anything about how Claude Code launches or speaks to MCP servers.
Does prompt caching still reduce cost?
Yes. The cache_control headers and the cache_read_input_tokens and cache_creation_input_tokens usage fields flow through unchanged. The gateway bills cache reads at the lower cached rate, so long sessions with stable system prompts and large repository context still benefit from cache hits.
Can I pin a specific model, or will the gateway swap it?
Whatever model you select in Claude Code is the model that runs. There is no silent downgrade and no shadow routing. If the request specifies Opus, the gateway forwards it to Opus. If it specifies Sonnet, Sonnet runs. The gateway also does not inject prompts, trim context, or rewrite tool definitions. This is the difference between a transparent gateway and a cheaper relay; the latter is not worth the saving.
What happens during an Anthropic outage?
The gateway is a thin layer in front of Anthropic. If Anthropic is down, requests fail with the upstream status code surfaced cleanly. The gateway does not pretend to answer with a different model; the failure is honest and Claude Code's normal retry logic applies.
Conclusion
Claude Code is the kind of tool that pays for itself when it works, and that ratio gets a lot better when the per-token cost goes down without anything else changing. The gateway model gives you that lever without forcing you to reorganize your workflow, retrain your team on a new tool, or accept a quietly different agent.
The recipe is small enough to keep in your head:
- Run the install script once per machine.
- Tier the model to the task. Sonnet by default, Opus when the answer is hard, Haiku when the work is mechanical.
- Mark stable system prompts as cacheable so long sessions amortize the prefix.
- Watch the live balance in the status line and adjust as you go.
Everything Claude Code can do directly with Anthropic, it can do through BUZZ AI: extended thinking, streaming tool use, prompt caching, MCP, dual SDK shape. The only thing that changes is the rate on the meter and where the bill arrives.
Start here: https://buzzai.cc/sh/claudecode.sh. Sixty seconds, one base URL, every feature intact.