One API Key for Claude, GPT, Gemini, and Grok: A Multi-Model Gateway in Practice
Four model families. One key. One base_url. The honest engineering tradeoffs of running production workloads through a unified gateway.
Most teams that ship with LLMs end up using more than one. You start with Claude for long reasoning, add GPT-5 for general chat, bring in Gemini when you need a million tokens of context, and reach for Grok when the task touches recent events. Each addition feels harmless. Then one Tuesday you stop to count the surface area of your stack and realize you are maintaining four SDKs, four billing accounts, four credential rotations, four sets of rate-limit dashboards, and four flavors of "why did this 401 at 3 AM."
This piece is a hands-on look at the alternative: routing every model family through a single gateway. We will keep it concrete with Python, show streaming and tool use, talk about cost, and be honest about the cases where a gateway is the wrong answer.
1. The actual cost of N model providers
Engineers underestimate the recurring cost of multi-provider integrations because the sticker price of each individual API is small. The cost lives somewhere else.
Cognitive load. Each provider has its own SDK shape. Anthropic uses messages.create with system as a top-level parameter and content blocks of typed objects. OpenAI uses chat.completions.create with system as the first message in a flat list. Gemini exposes generateContent with contents and tools nested differently. Tool-use schemas, finish reasons, streaming event types, and even the meaning of temperature are not identical. Switching between them mid-task forces you to swap mental models, not just code.
Operational overhead. Every provider needs its own account creation, its own KYC or payment-method dance, its own organization and project hierarchy, and its own set of API keys with their own rotation schedules. Multiply by the number of environments (dev, staging, prod) and the number of services in your monorepo, and you are looking at a small but persistent ops surface that nobody owns.
Procurement friction. Cards get blocked. Anti-fraud systems flag international charges. Spending limits trip silently. Different providers issue invoices on different cadences in different currencies. Reconciling LLM spend at the end of the quarter becomes a small finance project.
Compliance and audit. When a security review asks "what data flows to which third party," the answer needs to enumerate every provider you have integrated, not just the one you use most. Each is a separate vendor risk row.
None of this is a reason to pick a worse model. It is a reason to centralize the boring parts of access so the model choice can stay free.
2. What a unified gateway gives you
BUZZ AI Gateway exposes Anthropic Claude, OpenAI GPT, Google Gemini, and xAI Grok behind one HTTPS endpoint and one API key. The contract is intentionally narrow:
- Transparent forwarding. Request and response bodies pass through unchanged. The gateway does not modify your system prompt, inject instructions, silently substitute a cheaper model, or buffer streamed responses. The model you asked for is the model that runs.
- Zero data retention. Prompts and completions are never written to disk, database, or logs. Only billing metadata, things like token counts and model identifiers, is persisted. If a payload never lands, it cannot leak.
- SDK compatibility. The Anthropic and OpenAI SDKs work as-is. You change
base_urland your existing retries, streaming, and tool-use code keep working. - Unified billing. One ledger, one invoice, one prepaid balance across all four families. Per-model rates are listed at buzzai.cc/api/pricing and run significantly below first-party rates.
- Free model switching. Going from
claude-opus-4-8togpt-5.4to a Gemini long-context model is a parameter change, not a refactor.
Concretely, you point your client at one of two endpoints depending on which SDK you want to keep using:
| SDK | base_url |
|---|---|
| Anthropic Python / TypeScript | https://buzzai.cc |
| OpenAI Python / TypeScript | https://buzzai.cc/v1 |
That is the entire integration. Everything below is what you can do once those two lines are in place.
3. Hands-on: the same code, three model families
Let's start with the OpenAI SDK, because most teams already have it installed. The same client object will call GPT, Claude, and Gemini. The only thing that changes between requests is the value of model.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BUZZ_API_KEY"],
base_url="https://buzzai.cc/v1",
)
def ask(model: str, question: str) -> str:
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a precise technical assistant."},
{"role": "user", "content": question},
],
max_tokens=400,
)
return resp.choices[0].message.content
print(ask("claude-opus-4-8", "Outline a fault-tolerant job queue."))
print(ask("gpt-5.4", "Outline a fault-tolerant job queue."))
print(ask("gemini-2.5-pro", "Outline a fault-tolerant job queue."))
print(ask("grok-4", "Outline a fault-tolerant job queue."))
Four providers, four answers, one client object, one bill. There is no per-provider initialization, no second SDK, no second key in your secrets manager.
3.1 Streaming
Streaming is the test that often exposes shallow proxies, because a buffered "stream" defeats the point. BUZZ forwards SSE chunks as they arrive from the upstream provider, so token-by-token UI works exactly as it does on the first-party API.
stream = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "Explain CRDTs in three paragraphs."}],
stream=True,
)
for event in stream:
delta = event.choices[0].delta.content or ""
print(delta, end="", flush=True)
The same pattern works against claude-sonnet-4-6, gemini-2.5-pro, or any of the Grok models. If you prefer the native Anthropic streaming protocol, point the Anthropic SDK at https://buzzai.cc instead and use messages.stream as you normally would.
3.2 Tool use / function calling
Tool use is where leaky proxies usually break. BUZZ does not normalize across providers, on purpose. If you call Claude, you get Anthropic-shaped tool blocks. If you call GPT, you get OpenAI-shaped function calls. Each upstream's spec is preserved exactly.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
resp = client.chat.completions.create(
model="gpt-5.1-codex",
messages=[{"role": "user", "content": "Should I bring an umbrella in Tokyo?"}],
tools=tools,
)
call = resp.choices[0].message.tool_calls[0]
print(call.function.name, call.function.arguments)
For an Anthropic-style tool call against Claude, switch to the Anthropic SDK with base_url="https://buzzai.cc" and use the standard tools=[{"name": "...", "input_schema": {...}}] shape. The gateway forwards both formats untouched.
3.3 Claude Code in one shell command
If you live in Claude Code, the fastest way to point it at the gateway is the install script:
curl -fsSL https://buzzai.cc/sh/claudecode.sh | sh
It writes the right environment variables and base URL so claude talks to the gateway with your BUZZ key. Source: buzzai.cc/sh/claudecode.sh.
4. Choosing the right model per task
The point of carrying multiple families is not novelty. It is fitness for purpose. Treat the model name as a runtime parameter and route per task. A defensible default mapping looks something like this. The exact picks belong in your config file, not in your application code.
| Task shape | Strong default | Why |
|---|---|---|
| Long reasoning, agentic loops, hard refactors | claude-opus-4-8 / claude-opus-4-6 |
Opus-tier models stay coherent over long horizons and many tool turns. |
| Balanced everyday chat, RAG, drafting | claude-sonnet-4-6 / gpt-5.4 |
Sonnet- and GPT-5-class models give a strong quality-to-cost ratio. |
| High-volume classification, extraction, fanout | claude-haiku-4-5 / gpt-5.4-mini |
Smaller models keep latency and unit cost low when you call them millions of times. |
| Code generation, refactor, repo-scale edits | gpt-5.1-codex, gpt-5.2-codex, gpt-5.3-codex, gpt-5.1-codex-max |
The Codex line is tuned for programming tasks and tool-driven edits. |
| Very long context, multi-document synthesis | Gemini long-context models | Gemini's window is the practical choice when you need to feed in a whole book or a large repo. |
| Tasks that benefit from recent web context | Grok | Grok is positioned around freshness; useful when recency matters for the answer. |
The full live list of supported model IDs lives at buzzai.cc/models. Treat the table above as a starting heuristic, not a benchmark claim. Run your own evals on your own data before locking in a default.
route(task_kind) function that maps task kinds (chat, draft, classify, refactor, summarize-long) to model IDs. Keep the mapping in config. When a new model ships, you change one row, not every call site.
5. When NOT to use a multi-model gateway
A gateway is a tool. It is not the right tool for every team.
- You only ever use one provider. If your roadmap is "GPT-5 forever" or "Claude only," the second hop is dead weight. Go direct.
- You need the absolute lowest single-hop latency. Any proxy adds a network hop. For most workloads it is dominated by token-generation time, but if you are chasing every millisecond of TTFT, direct is direct.
- You already have a negotiated enterprise contract. If your company has committed-use credits, custom data-handling addenda, or volume discounts with a provider, that contract may already beat any gateway price and may also constrain where data can go.
- You require a region or compliance boundary the gateway does not offer. If your regulator says "data must terminate in jurisdiction X with vendor Y," respect that. A gateway cannot rewrite a regulatory perimeter.
- Your model is bound to a provider-specific feature that has not been exposed via the gateway yet (a brand-new beta endpoint, for example). Direct integration may be temporarily simpler.
For the common case of "I am building a product, I want to compare and route across families, and I do not want to be in the API-key management business," a gateway is the right call. For the cases above, going direct is the honest answer and we will tell you that.
6. What you do not have to give up
Switching to a gateway is sometimes pitched as a tradeoff between convenience and trust. With BUZZ specifically, the trust questions have narrow answers.
- Will my prompts be used to train models? No. They are not stored, so they cannot be.
- Will my system prompt be modified? No. The forwarder does not rewrite the body.
- Will the gateway pick a different model than the one I asked for? No silent substitution.
claude-opus-4-8meansclaude-opus-4-8. - Will streaming be buffered? No. SSE chunks are forwarded as they arrive.
- Will my tool-call shape be normalized? No. You get the upstream provider's exact format.
The general principle is "do less." A gateway that does less is a gateway that breaks fewer things and surprises you fewer times.
7. A worked example: routing inside one application
Here is a small but realistic pattern. The application has three task classes: a fast classifier in front of an inbox, a balanced drafter for replies, and a deep reviewer that runs over long documents. Three jobs, three model families, one client.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BUZZ_API_KEY"],
base_url="https://buzzai.cc/v1",
)
ROUTES = {
"classify": "claude-haiku-4-5", # cheap, fast, high volume
"draft": "gpt-5.4", # balanced general chat
"review": "claude-opus-4-8", # deep reasoning
"long_doc": "gemini-2.5-pro", # very long context
"fresh": "grok-4", # recency-sensitive answers
}
def run(task: str, prompt: str, max_tokens: int = 800) -> str:
model = ROUTES[task]
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
return resp.choices[0].message.content
Two properties worth calling out. First, adding a sixth task class is a single dictionary entry, not a new SDK install. Second, when next quarter's evals say claude-sonnet-4-6 beats gpt-5.4 on the drafter, that is a one-line change in ROUTES with no application surgery.
The same idea works in TypeScript with the OpenAI Node SDK, in Go with any of the OpenAI-compatible clients, or directly over fetch if you would rather not pull in a dependency at all. The wire format is the contract; the client is just convenience.
8. A minimal production checklist
- Put
BUZZ_API_KEYin a secret manager, not in the repo. - Pin
base_urltohttps://buzzai.cc/v1(OpenAI SDK) orhttps://buzzai.cc(Anthropic SDK) in a single config file. - Keep
modelas a config value, not a constant. Route per task class. - Set per-call timeouts and bounded retries on the client side; the gateway does not retry for you.
- Track token usage from the response body. Aggregate by model and by task class. This is the data you will want when deciding whether a smaller model would do.
- For high-volume jobs, prefer streaming so you can cut off generations early when a structured stop condition is met.
- Re-evaluate model defaults on a quarterly cadence. The frontier moves; your config should too.
9. FAQ
Is the gateway a real proxy, or does it transform my prompts?
It is a transparent forwarder. Request and response bodies pass through unchanged. No system-prompt rewriting, no instruction injection, no silent model substitution, no buffering of streamed responses.
Do you store my prompts or completions?
No. BUZZ runs with zero data retention on payloads. Prompts and completions are not written to disk, database, or logs. The only thing kept is billing metadata: token counts, model IDs, and the operational fields needed to invoice correctly.
Can I keep using the official Anthropic and OpenAI SDKs?
Yes. Change base_url. For Anthropic SDK, point to https://buzzai.cc. For OpenAI SDK, point to https://buzzai.cc/v1. Existing retry, streaming, and tool-use code keeps working.
How does pricing compare to first-party rates?
Prices on BUZZ run significantly below the first-party rate cards across the supported families. Live per-model rates are published at buzzai.cc/api/pricing and update with the upstream models.
How do I switch from Claude to GPT or Gemini in my app?
Change the value you pass as model. The base URL, key, headers, and request shape stay the same when you stay within one SDK. Most teams keep the model ID in config so swapping is a deploy-free change.
Does streaming work end-to-end?
Yes. The gateway forwards SSE events directly from the upstream provider. Token-by-token UIs behave the same as they would on the first-party API.
What about tool use and function calling?
Tool use is forwarded in the upstream provider's native shape. Anthropic-style tool blocks and OpenAI-style function calls are passed through without re-encoding.
What latency overhead does the gateway add?
The added latency is one network hop plus the gateway's forwarding work. For typical generation workloads this is a small fraction of total response time, which is dominated by token generation on the upstream model. If you are running a latency-sensitive single-token classifier, benchmark before committing.
How do I monitor usage across models?
The dashboard at buzzai.cc shows per-model token consumption and spend. Each response also returns native usage fields, so you can ship usage events to your own observability stack and aggregate across models in your own database.
What happens if an upstream provider has an incident?
The gateway surfaces the upstream error transparently. We do not silently retry against a different model, because that would change the contract you wrote your code against. If you want cross-family failover, that is a policy decision and belongs in your application's routing layer.
10. Conclusion
A multi-model gateway is not magic. It is a small, boring piece of infrastructure that takes the part of LLM integration that should be commoditized, access, billing, and key management, and makes it commoditized. Everything else, the model choice, the prompt design, the eval suite, stays in your hands where it belongs.
If you are already running two or more model families in production, the question is no longer whether to centralize access. It is whether you want to keep paying the recurring cost of N integrations every quarter. If you are still on one, treat the gateway as optionality: it costs you nothing to be ready to add the next family on the day it ships.
Start at buzzai.cc, check live rates at buzzai.cc/api/pricing, and see the full model list at buzzai.cc/models.