Security Compliance

Zero-Retention LLM Gateways: Why Enterprises Need Forwarders That Forget

Most LLM relays are quietly storing your prompts. For consumer toys that is annoying. For an enterprise shipping production traffic, it is a regulatory landmine, an attack surface, and a slow leak of your competitive moat. Here is what zero retention actually means, and how to tell whether your gateway is bluffing.

By BUZZ AI Gateway 10 min read Updated May 22, 2026

If you put any kind of routing layer between your application and an LLM provider, you have just introduced a new place where your customer data can live. That layer can be a homemade reverse proxy, an open-source router, or a managed AI gateway. From the perspective of GDPR, SOC 2, HIPAA, or your own incident-response playbook, the question is the same: does the middlebox keep what passes through it?

Most do. They keep prompts in request logs for debugging. They keep completions in caches to reduce upstream cost. They keep both in analytics warehouses to build dashboards. They keep snapshots in error trackers when something throws. Some of this is innocent engineering convenience. All of it is, from a compliance posture, the same thing: a vendor now holds copies of your customers' inputs.

A zero-retention gateway is the opposite design. The middlebox forwards bytes upstream, forwards bytes back, and forgets. The only thing that survives the request is the meter reading. This article walks through what that actually requires, why retention is dangerous in production, and how to verify a vendor's claim before you route real traffic through them.

What "zero retention" should actually mean

The phrase has been worn smooth. Marketing pages say "we don't retain your data" while the underlying system writes prompts to a 30-day log bucket and calls that "operational." A useful definition has to be sharper.

Zero retention, as a technical guarantee, should mean: the request body and the response body are never written to any persistent medium. They exist in process memory for the lifetime of the upstream call, and they are released the moment the response completes. Specifically, a zero-retention gateway:

If a vendor's retention page is silent on any of these, treat that silence as a "yes." Engineering teams are not shy about advertising the absence of features that customers care about.

It is also worth being explicit about what zero retention is not. It is not the same as a "no training" policy. A no-training policy says: we will not feed your data into our models. Zero retention says: we will not store your data, full stop. The first is weaker. It still permits storage, human review, internal analytics, and breach exposure. The second eliminates an entire category of risk by removing the data from the threat surface in the first place.

Why retention is dangerous in production

"We log prompts for debugging" sounds harmless until you sit down with your privacy counsel. There are at least four ways that retention turns into a liability the moment it leaves your perimeter.

1. Regulatory exposure (GDPR, SOC 2, HIPAA, CCPA)

Most enterprise privacy frameworks are built around a principle of data minimization: collect and keep only what you need, for as long as you need it. When a third-party gateway stores prompts, those prompts are now processed by a sub-processor your customers may not have consented to. Under GDPR Article 28, you need a data processing agreement with that sub-processor and a defensible accounting of what they hold. Under SOC 2, the gateway becomes part of your privacy and confidentiality boundary, and the auditor will want to know how prompt data is protected, retained, and disposed. Under HIPAA, if any prompt could contain PHI, the gateway is a business associate and needs a BAA. Each of these gets dramatically simpler when the answer is "the gateway never stores the body in the first place."

2. Downstream liability for your customers' data

If your application accepts user input and passes any portion of it into an LLM call, your gateway is now a custodian of your users' content. A breach at the gateway is a breach of your customers' data, even though you did not write the storage layer that lost it. You will be the one notifying your users, not the gateway vendor. Reducing the gateway's retention surface is one of the cheapest ways to reduce your blast radius.

3. Prompt content as an attack surface

A retained prompt is a prompt that can be exfiltrated. Most LLM-driven applications include private system prompts, retrieval context, and user-supplied data inside the same request body. If a gateway stores those bodies, a single credential leak, SSRF bug, or insider event at the vendor turns into disclosure of every system prompt and every retrieval snippet that ever passed through. There is no way to "rotate" a prompt the way you rotate an API key. Once the corpus is out, it is out.

4. Competitive and IP loss

Your prompts encode your business logic. The exact phrasing of your evaluator prompt, the shape of your tool-calling protocol, the few-shot examples you spent months tuning, the order in which you inject retrieval results: these are the moat. They are also the thing a retained log makes trivial to copy. A gateway that holds your bodies is a gateway that holds your IP. Treat that the way you would treat handing your training data to a third party.

The architecture of a zero-retention gateway

Zero retention is an architectural property, not a policy bullet. Policies change. Architectures persist. The shape of a correctly built forwarder looks like this:

      client                  gateway                upstream LLM
        |                        |                         |
        |  POST /v1/messages     |                         |
        |  (request body)        |                         |
        |--------------------->  |                         |
        |                        |  open upstream conn     |
        |                        |  stream body through    |
        |                        |-----------------------> |
        |                        |                         |
        |                        |  <-- response chunks    |
        |                        |  pass-through to client |
        |  <------- chunks ----  |                         |
        |                        |                         |
        |                        |  on stream end:         |
        |                        |  parse usage headers    |
        |                        |  write 1 billing row    |
        |                        |  drop body buffer       |
        v                        v                         v
                            [ memory only ]          [ provider keeps
                            [ no disk, no DB ]         per its own ToS ]

A few properties of this design matter:

Architecture, not policy. The reason this matters is that policies can be relaxed under pressure. An engineer adding "just one line of debug logging" to chase down a bug can quietly break a policy-based guarantee. An architectural guarantee fails closed: there is no code path that writes the body, so there is no line to add.

What is retained, and why

"Zero retention of bodies" is not "zero retention of everything." A gateway has to retain enough metadata to bill, to enforce rate limits, and to give you usage reporting. The honest answer is that the following operational metadata is kept:

FieldWhy it must be kept
modelDifferent models have different unit prices. The model name is needed to compute the bill.
input_tokensThe integer count of tokens in the request, parsed from the upstream usage metadata. Required to invoice.
output_tokensSame, for the completion side. Required to invoice.
cache_read_tokens / cache_write_tokensFor providers with prompt caching, separate counters are needed because they are priced differently.
timestampRequired for daily aggregation, rate limiting, and audit trail.
user_idThe authenticated identity that owns the API key. Required to attribute usage to an account.
http_statusWhether the upstream call succeeded, was rate limited, or errored. Required to handle refunds and retries correctly.
latency_msEnd-to-end transport latency. Useful for SLA reporting. Not derived from body content.

Notice what is not on this list: prompts, completions, system messages, tool definitions, tool call arguments, retrieved documents, attachments, or any text content. None of that is needed to bill, and a zero-retention gateway does not collect what it does not need.

Token counts in isolation are integers. They cannot be reversed into prompt text. Most regulators and auditors treat aggregate billing metrics like token counts as operational telemetry rather than personal data. The combination with a user ID may fall under your own privacy policy, but the prompt itself, the part that carries the actual regulatory weight, is gone.

How to verify a gateway's retention claims as a customer

Marketing claims are cheap. Before you route production traffic through any gateway, run a verification checklist. None of these steps require the vendor's cooperation, and they will tell you more than a glossy compliance page.

Read the Terms of Service line by line

Search the ToS for the words retain, store, log, cache, archive, analytics, improve, and training. Look for retention windows. "We retain request data for 30 days" is not zero retention. "We retain request data for as long as necessary to provide the service" is also not zero retention; it is a blank check. The phrase you want is closer to "request and response bodies are not written to persistent storage."

Ask for the data flow diagram

Any vendor serious about this will be able to send you a diagram showing where bytes go between ingress and egress. Look for forks: every arrow that branches off the main forward path is a place where retention can hide. A clean zero-retention diagram has exactly two paths, request and response, with one side branch for billing metadata.

Run a negative test

Send a request through the gateway containing a unique, searchable token, something like ZRTEST-7c4a9b2f-please-do-not-store-this. Then look for that token everywhere the vendor exposes data: the dashboard, account export endpoints, support chats, error tracker integrations, sample webhooks. If the token shows up anywhere except the upstream provider's own logs, the vendor is retaining bodies.

Capture the wire

Run the same request twice: once directly against the upstream provider, once through the gateway. With the gateway's outbound IP whitelisted (or via a controlled tap on your egress) you can confirm byte-for-byte that the body the gateway forwards matches the body you sent. Any divergence, including injected system prompts, header rewrites that touch content, or response compression that changes payload, is a sign of a non-transparent forwarder.

Check the SOC 2 / pen-test summary

For a serious vendor, ask for the SOC 2 Type II report or a recent third-party penetration test summary. Look at the system description for the components in scope. If the description lists log aggregation systems, analytics warehouses, or message queues that touch request bodies, that is your retention surface. If it does not, that is a strong signal.

Watch for "unless required for security" carve-outs

Some vendors say they do not retain "except as required for abuse prevention." This often means a sampling of traffic is mirrored to a separate moderation system. That can be acceptable, but only if the moderation pipeline is itself zero-retention and runs in memory. Ask explicitly: where does the moderation copy go, and how long does it live?

BUZZ in this picture

BUZZ AI Gateway is built as the architecture above. One API key reaches Claude, GPT, Gemini, and Grok. Request and response bodies are streamed through; they are never written to logs, never persisted to a database, never cached for replay across users, and never forwarded to a third-party analytics or moderation pipeline. The only data that survives a request is the billing row: model name, token counts, timestamp, user ID, status, and latency.

Integration is a one-line change. Point the Anthropic SDK at https://buzzai.cc or the OpenAI SDK at https://buzzai.cc/v1. The wire protocol is identical to the upstream provider, because BUZZ does not modify it.

# Anthropic SDK
from anthropic import Anthropic
client = Anthropic(
    base_url="https://buzzai.cc",
    api_key="sk-buzz-..."
)

# OpenAI SDK
from openai import OpenAI
client = OpenAI(
    base_url="https://buzzai.cc/v1",
    api_key="sk-buzz-..."
)

Pricing for every supported model is enumerated at buzzai.cc/api/pricing, and the live model catalog with capabilities is at buzzai.cc/models. Both endpoints are public and machine-readable, so you can wire them into your own provisioning logic without screen-scraping a marketing page.

The short version

Zero retention means the bytes that carry your customers' data never land on disk inside the gateway. It is not a slogan; it is an architectural property you can verify with a wire capture. If your relay can show you "recent prompts" in a dashboard, it has retention. If it cannot, and never will, it has the property you actually want.

Try BUZZ AI Gateway  ·  See model pricing  ·  Browse models

Frequently asked questions

What does zero retention actually mean for an LLM gateway?

Zero retention means the request and response bodies are never written to any persistent medium. They live only in process memory while the request is in flight, and they are released as soon as the upstream model finishes responding. No log files, no databases, no analytics pipelines, no debug snapshots.

Are token counts considered personally identifiable information (PII)?

Token counts in isolation are integers without semantic content. They cannot be reversed into prompt text. Most regulators and auditors treat aggregate billing metrics like token counts as operational telemetry, not PII. The combination of token count plus user ID may fall under your own privacy policy, but the prompt content itself is what carries regulatory weight.

What about debug logs? How do you investigate failures without prompts?

A zero-retention gateway logs the metadata required to diagnose transport-level issues: HTTP status codes, upstream latency, error categories, request IDs returned by the provider. It does not log prompt or completion content. If a failure mode genuinely cannot be diagnosed from metadata alone, the engineering response is to add a structured error category, not to start retaining bodies.

What if billing fails because the response was streamed and dropped?

Token counts arrive in upstream response headers and the final usage event of the SSE stream. The gateway parses those counters as the stream completes and writes a single billing row. The body itself is never persisted. If the stream aborts mid-flight, the gateway records partial usage with a status flag so refunds and retries can be handled cleanly.

Does zero retention prevent caching across users?

Yes. A correct zero-retention forwarder never caches a completion and replays it to a different user. Cross-user caching reuses one user's prompt context to serve another, which is a confidentiality leak even when the cache is in memory. If a gateway advertises "cache hits across customers" as a feature, it is not zero retention.

How is zero retention different from a no-training policy?

A no-training policy says "we will not feed your data into model training." Zero retention says "we will not store your data in the first place." No-training is weaker because it permits storage, human review, internal analytics, and breach exposure. Zero retention removes the data from the threat surface entirely.

Can a zero-retention gateway support content moderation or safety filters?

Yes, but the filter must run in memory, on the same request lifecycle, and emit only a verdict (allow / block / category). The filter must not persist the inspected text. Many compliance teams treat in-memory safety classification as acceptable as long as the inspected content does not survive the request.

Does zero retention help with HIPAA, GDPR, or SOC 2?

Zero retention helps satisfy data-minimization requirements under GDPR, the privacy and confidentiality criteria under SOC 2, and reduces HIPAA scope by limiting where PHI can land. It is not a complete compliance program by itself, but it removes one of the largest classes of risk: vendor-side prompt storage. You still need contracts, access controls, and incident response on your own side.

How can I verify a gateway is actually zero retention?

Read the Terms of Service for retention clauses, ask the vendor for their data flow diagram, request a SOC 2 report or pen-test summary, and run a negative test by sending a uniquely tagged prompt and searching for it in any vendor-exposed dashboard, log, or support channel. A byte-for-byte capture between you, the gateway, and the upstream provider should show no third-party storage hop.

Is BUZZ AI Gateway zero retention?

Yes. BUZZ is built as a transparent forwarder. Request and response bodies live only in memory for the duration of the upstream call. Only billing metadata (model name, token counts, timestamp, user ID, status, latency) is persisted. See buzzai.cc/api/pricing for the model catalog and buzzai.cc for integration details.

Conclusion

Retention is the default mode for software because storage is cheap and engineers like having logs to look at. For an LLM gateway sitting between an enterprise application and a model provider, that default is the wrong choice. Every prompt that gets retained is a future breach notification, a future audit finding, a future competitor copying your moat. The cost of that retention does not show up in your monthly bill; it shows up the day something goes wrong.

A zero-retention gateway flips the default. The bytes pass through, the meter counts the tokens, and the bodies are gone. The only thing that survives is the receipt. If you are evaluating an AI gateway and the vendor's answer to "where do my prompts live" is anything other than "they don't, and here is the architecture diagram that proves it," keep looking.

For teams who want a transparent, zero-retention forwarder that fronts Claude, GPT, Gemini, and Grok behind one key, see BUZZ AI Gateway, the public pricing endpoint, and the model catalog.