BetaResponse caching is currently in beta. The API and behavior may change.
0), reducing both latency and cost.
Response caching is model-agnostic and works with every model available on OpenRouter across all supported endpoints, regardless of provider. Caching operates at the OpenRouter layer before the request reaches any provider, so no provider-side support is required.
Both streaming and non-streaming requests are eligible for caching. Only successful (200 OK) responses are cached. Error responses, rate limit responses, and partial results are never cached. Responses containing tool calls are cached normally since they are part of a successful completion. For streaming requests, the cached response is replayed through the same streaming pipeline, so the client receives the same content chunks on a cache hit. The id field, created timestamp, and X-Generation-Id response header in each chunk reflect the new cache-hit generation record, not the original.
Enabling Caching
There are two ways to enable response caching:1. Per-Request via Headers
Add theX-OpenRouter-Cache header to enable caching for individual requests:
The first request results in a cache MISS. The response is stored and billed normally:
Response Headers (MISS)
Response Body (MISS)
HIT with zeroed usage and no billing. Each cache hit receives its own unique generation ID (note gen-def456 below, different from the original gen-abc123):
Response Headers (HIT)
Response Body (HIT)
2. Via Presets
You can enable caching for all requests that use a specific preset by configuring these fields in the preset:| Field | Type | Description |
|---|---|---|
cache_enabled | boolean | Enable caching for all requests using this preset |
cache_ttl_seconds | number | Default TTL for cached responses (1-86400 seconds, default 300) |
cache_enabled is set on a preset, caching is automatically applied to every request that references that preset. No X-OpenRouter-Cache header is required.
Example preset configuration:
How It Works
Two requests are considered identical when they share the same API key, model, endpoint type, streaming mode, and request body (including all parameters). When caching is enabled, OpenRouter generates a cache key from these inputs. If an identical request has been made before and the cached response has not expired, the cached response is returned immediately. Changing any of these–including the model, endpoint, or switching between streaming and non-streaming–produces a different cache key and a cache miss. Since caching operates at the OpenRouter layer before the request is forwarded, it works with every model and provider across the supported endpoint types. Cache is scoped to your API key. Different API keys, even under the same account or organization, do not share cache. Rotating your API key will result in an empty cache for the new key.Non-determinism: Cached responses are returned verbatim regardless of stochastic parameters like
temperature. If you need fresh responses, use X-OpenRouter-Cache-Clear: true or a short TTL.Cache Key Details
The cache key is derived from your API key, model, endpoint type, streaming mode, and a SHA-256 hash of the request body. Streaming and non-streaming requests are cached separately, so astream: true request will not return a cached non-streaming response and vice versa. The request body is normalized before hashing, so extra whitespace does not affect the cache key. However, the property order of the JSON body is significant:
- Different property ordering in logically identical JSON (e.g.
{"model":"x","messages":[]}vs{"messages":[],"model":"x"}) will produce different cache keys - Omitting optional fields vs. explicitly sending defaults (e.g.
temperature: 1.0) produces different keys - Attribution headers (e.g.
HTTP-Referer,X-Title) and provider-specific headers are not part of the cache key - Multimodal requests (images, audio, video, file attachments) are eligible for caching. The full request body, including base64-encoded content, is included in the hash
Precedence
Request headers and preset configuration interact as follows:- If a preset explicitly sets
cache_enabled: false, caching is disabled regardless of request headers–the header cannot override a preset opt-out X-OpenRouter-Cache: falseheader disables caching even if the preset enables itX-OpenRouter-Cache: trueenables caching when the preset does not configure caching (i.e.cache_enabledis absent)–but cannot override a preset that explicitly setscache_enabled: false(rule 1 takes precedence)X-OpenRouter-Cache-TTLheader overrides the presetcache_ttl_seconds(default: 300 seconds)- If neither header nor preset is set, caching is off
Concurrent Requests
If two identical requests arrive simultaneously before the first response is written to cache, both result in a cacheMISS and are billed independently. There is no request coalescing.
Supported Endpoints
| Endpoint | API Format |
|---|---|
/api/v1/chat/completions | OpenAI Chat Completions |
/api/v1/responses | OpenAI Responses |
/api/v1/messages | Anthropic Messages |
/api/v1/embeddings | OpenAI Embeddings |
Provider caching: Some providers offer their own prompt caching (e.g. Anthropic prompt caching, OpenAI cached context). Provider caching is separate from OpenRouter response caching and the two can be used together. OpenRouter caching operates at the request level before the call reaches the provider, while provider caching operates within the provider’s infrastructure.
Request Headers
| Header | Value | Description |
|---|---|---|
X-OpenRouter-Cache | true | Enable caching for this request |
X-OpenRouter-Cache | false | Disable caching for this request (overrides preset) |
X-OpenRouter-Cache-TTL | <seconds> | Custom TTL (1-86400 seconds, default 300) |
X-OpenRouter-Cache-Clear | true | Force a cache refresh for this request |
60abc is treated as 60); decimal values are truncated (e.g., 1.5 is treated as 1). Numeric values outside the valid range are clamped to [1, 86400].
Response Headers
| Header | Value | Description |
|---|---|---|
X-OpenRouter-Cache-Status | HIT or MISS | Whether the response was served from cache |
X-OpenRouter-Cache-Age | <seconds> | How long the response has been cached (on HIT only) |
X-OpenRouter-Cache-TTL | <seconds> | Remaining TTL on HIT; full TTL on MISS |
X-Generation-Id header is also present on every response (cached or not) and is not specific to caching. On a cache hit, the generation ID is unique to that hit–it is not reused from the original response.
TTL (Time-to-Live)
The TTL controls how long a cached response remains valid.- Default: 300 seconds (5 minutes)
- Range: 1 second to 86400 seconds (24 hours)
X-OpenRouter-Cache-TTL header, or set a default TTL in your preset configuration.
Cache Clearing
To force a fresh response for a specific request, send theX-OpenRouter-Cache-Clear: true header alongside X-OpenRouter-Cache: true (or with a preset that has cache_enabled: true). This deletes the existing cached entry for that cache key, makes a new request to the provider, and stores the new response. X-OpenRouter-Cache-Clear has no effect unless caching is enabled for the request. This does not clear all cached entries–only the one matching the current request.
The new cache entry uses the TTL from the current request’s X-OpenRouter-Cache-TTL header, the preset cache_ttl_seconds, or the default (300 seconds), following the standard precedence rules.
Billing
Cache hits are free. No tokens are consumed and all billable usage counters are reported as0. For chat completions and Responses endpoints, usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens are zeroed. For the Embeddings endpoint, usage.prompt_tokens and usage.total_tokens are zeroed (completion_tokens is not present in embeddings responses). For the Anthropic Messages endpoint, usage.input_tokens and usage.output_tokens are zeroed. You are only billed for the original request that populates the cache (a cache MISS).
Cache hits do not count toward provider rate limits since the request never reaches a provider.
Limitations
- Disabled for account-level Zero Data Retention (ZDR): Response caching is not available when account-level ZDR is enforced, since caching requires temporarily storing response data. Per-request
provider.zdrdoes not affect cache eligibility. - Concurrent identical requests: If two identical requests arrive before the first response is cached, both result in a
MISS. See Concurrent Requests. - Cache eviction: Cached responses may be evicted before TTL expiry under memory pressure. There is no limit on the number of entries you can cache, but eviction under pressure means entries are not guaranteed to survive their full TTL.
Data Retention
Cached responses are stored in edge infrastructure, retained only for the TTL duration, and automatically evicted upon expiry. Cached data is accessible only via the API key that triggered the caching–no other key, account, or organization can retrieve it. Cached data is not used for training or shared with third parties.Use Cases
Agent Workflows
When an agent workflow fails partway through, you can resume from the point of failure without re-running and re-paying for identical earlier requests. Enable caching at the start of the workflow and all prior steps return immediately from cache on retry.Unit Testing
Get repeatable responses for your test suite. After the initial run populates the cache, subsequent identical requests return the same cached response every time at zero cost. For deterministic first-run results, usetemperature: 0 or a fixed seed.