Response Caching - Cache Identical API Responses

Beta

BetaResponse caching is currently in beta. The API and behavior may change.

Response caching allows you to cache responses for identical API requests. When a cached response is available, OpenRouter returns it immediately from cache with no billing (all billable usage counters are reported as 0), reducing both latency and cost. Response caching is model-agnostic and works with every model available on OpenRouter across all supported endpoints, regardless of provider. Caching operates at the OpenRouter layer before the request reaches any provider, so no provider-side support is required. Both streaming and non-streaming requests are eligible for caching. Only successful (200 OK) responses are cached. Error responses, rate limit responses, and partial results are never cached. Responses containing tool calls are cached normally since they are part of a successful completion. For streaming requests, the cached response is replayed through the same streaming pipeline, so the client receives the same content chunks on a cache hit. The id field, created timestamp, and X-Generation-Id response header in each chunk reflect the new cache-hit generation record, not the original.

Enabling Caching

There are two ways to enable response caching:

1. Per-Request via Headers

Add the X-OpenRouter-Cache header to enable caching for individual requests: The first request results in a cache MISS. The response is stored and billed normally:

Response Headers (MISS)

HTTP/2 200
X-OpenRouter-Cache-Status: MISS
X-OpenRouter-Cache-TTL: 300

Response Body (MISS)

{
  "id": "gen-abc123",
  "model": "google/gemini-2.5-flash",
  "choices": ["..."],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 120,
    "total_tokens": 135
  }
}

Sending the same request again returns a cache HIT with zeroed usage and no billing. Each cache hit receives its own unique generation ID (note gen-def456 below, different from the original gen-abc123):

Response Headers (HIT)

HTTP/2 200
X-OpenRouter-Cache-Status: HIT
X-OpenRouter-Cache-Age: 12
X-OpenRouter-Cache-TTL: 288
X-Generation-Id: gen-def456

Response Body (HIT)

{
  "id": "gen-def456",
  "created": 1746000012,
  "model": "google/gemini-2.5-flash",
  "choices": ["..."],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

2. Via Presets

You can enable caching for all requests that use a specific preset by configuring these fields in the preset:

Field	Type	Description
`cache_enabled`	`boolean`	Enable caching for all requests using this preset
`cache_ttl_seconds`	`number`	Default TTL for cached responses (1-86400 seconds, default 300)

When cache_enabled is set on a preset, caching is automatically applied to every request that references that preset. No X-OpenRouter-Cache header is required. Example preset configuration:

{
  "name": "cached-tests",
  "cache_enabled": true,
  "cache_ttl_seconds": 600
}

How It Works

Two requests are considered identical when they share the same API key, model, endpoint type, streaming mode, and request body (including all parameters). When caching is enabled, OpenRouter generates a cache key from these inputs. If an identical request has been made before and the cached response has not expired, the cached response is returned immediately. Changing any of these–including the model, endpoint, or switching between streaming and non-streaming–produces a different cache key and a cache miss. Since caching operates at the OpenRouter layer before the request is forwarded, it works with every model and provider across the supported endpoint types. Cache is scoped to your API key. Different API keys, even under the same account or organization, do not share cache. Rotating your API key will result in an empty cache for the new key.

Non-determinism: Cached responses are returned verbatim regardless of stochastic parameters like temperature. If you need fresh responses, use X-OpenRouter-Cache-Clear: true or a short TTL.

Cache Key Details

The cache key is derived from your API key, model, endpoint type, streaming mode, and a SHA-256 hash of the request body. Streaming and non-streaming requests are cached separately, so a stream: true request will not return a cached non-streaming response and vice versa. The request body is normalized before hashing, so extra whitespace does not affect the cache key. However, the property order of the JSON body is significant:

Different property ordering in logically identical JSON (e.g. {"model":"x","messages":[]} vs {"messages":[],"model":"x"}) will produce different cache keys
Omitting optional fields vs. explicitly sending defaults (e.g. temperature: 1.0) produces different keys
Attribution headers (e.g. HTTP-Referer, X-Title) and provider-specific headers are not part of the cache key
Multimodal requests (images, audio, video, file attachments) are eligible for caching. The full request body, including base64-encoded content, is included in the hash

Precedence

Request headers and preset configuration interact as follows:

If a preset explicitly sets cache_enabled: false, caching is disabled regardless of request headers–the header cannot override a preset opt-out
X-OpenRouter-Cache: false header disables caching even if the preset enables it
X-OpenRouter-Cache: true enables caching when the preset does not configure caching (i.e. cache_enabled is absent)–but cannot override a preset that explicitly sets cache_enabled: false (rule 1 takes precedence)
X-OpenRouter-Cache-TTL header overrides the preset cache_ttl_seconds (default: 300 seconds)
If neither header nor preset is set, caching is off

Concurrent Requests

If two identical requests arrive simultaneously before the first response is written to cache, both result in a cache MISS and are billed independently. There is no request coalescing.

Supported Endpoints

Endpoint	API Format
`/api/v1/chat/completions`	OpenAI Chat Completions
`/api/v1/responses`	OpenAI Responses
`/api/v1/messages`	Anthropic Messages
`/api/v1/embeddings`	OpenAI Embeddings

Cache keys include an endpoint type discriminator, so requests to different endpoints with identical bodies will not collide.

Provider caching: Some providers offer their own prompt caching (e.g. Anthropic prompt caching, OpenAI cached context). Provider caching is separate from OpenRouter response caching and the two can be used together. OpenRouter caching operates at the request level before the call reaches the provider, while provider caching operates within the provider’s infrastructure.

Request Headers

Header	Value	Description
`X-OpenRouter-Cache`	`true`	Enable caching for this request
`X-OpenRouter-Cache`	`false`	Disable caching for this request (overrides preset)
`X-OpenRouter-Cache-TTL`	`<seconds>`	Custom TTL (1-86400 seconds, default 300)
`X-OpenRouter-Cache-Clear`	`true`	Force a cache refresh for this request

TTL values that cannot be parsed as an integer (i.e., do not begin with digits) are ignored and fall through to the preset or default TTL. Values beginning with digits are accepted even if they contain trailing non-numeric characters (e.g., 60abc is treated as 60); decimal values are truncated (e.g., 1.5 is treated as 1). Numeric values outside the valid range are clamped to [1, 86400].

Response Headers

Header	Value	Description
`X-OpenRouter-Cache-Status`	`HIT` or `MISS`	Whether the response was served from cache
`X-OpenRouter-Cache-Age`	`<seconds>`	How long the response has been cached (on `HIT` only)
`X-OpenRouter-Cache-TTL`	`<seconds>`	Remaining TTL on `HIT`; full TTL on `MISS`

The X-Generation-Id header is also present on every response (cached or not) and is not specific to caching. On a cache hit, the generation ID is unique to that hit–it is not reused from the original response.

TTL (Time-to-Live)

The TTL controls how long a cached response remains valid.

Default: 300 seconds (5 minutes)
Range: 1 second to 86400 seconds (24 hours)

You can customize the TTL per-request using the X-OpenRouter-Cache-TTL header, or set a default TTL in your preset configuration.

Cache Clearing

To force a fresh response for a specific request, send the X-OpenRouter-Cache-Clear: true header alongside X-OpenRouter-Cache: true (or with a preset that has cache_enabled: true). This deletes the existing cached entry for that cache key, makes a new request to the provider, and stores the new response. X-OpenRouter-Cache-Clear has no effect unless caching is enabled for the request. This does not clear all cached entries–only the one matching the current request. The new cache entry uses the TTL from the current request’s X-OpenRouter-Cache-TTL header, the preset cache_ttl_seconds, or the default (300 seconds), following the standard precedence rules.

Billing

Cache hits are free. No tokens are consumed and all billable usage counters are reported as 0. For chat completions and Responses endpoints, usage.prompt_tokens, usage.completion_tokens, and usage.total_tokens are zeroed. For the Embeddings endpoint, usage.prompt_tokens and usage.total_tokens are zeroed (completion_tokens is not present in embeddings responses). For the Anthropic Messages endpoint, usage.input_tokens and usage.output_tokens are zeroed. You are only billed for the original request that populates the cache (a cache MISS). Cache hits do not count toward provider rate limits since the request never reaches a provider.

Limitations

Disabled for account-level Zero Data Retention (ZDR): Response caching is not available when account-level ZDR is enforced, since caching requires temporarily storing response data. Per-request provider.zdr does not affect cache eligibility.
Concurrent identical requests: If two identical requests arrive before the first response is cached, both result in a MISS. See Concurrent Requests.
Cache eviction: Cached responses may be evicted before TTL expiry under memory pressure. There is no limit on the number of entries you can cache, but eviction under pressure means entries are not guaranteed to survive their full TTL.

Data Retention

Cached responses are stored in edge infrastructure, retained only for the TTL duration, and automatically evicted upon expiry. Cached data is accessible only via the API key that triggered the caching–no other key, account, or organization can retrieve it. Cached data is not used for training or shared with third parties.

Use Cases

Agent Workflows

When an agent workflow fails partway through, you can resume from the point of failure without re-running and re-paying for identical earlier requests. Enable caching at the start of the workflow and all prior steps return immediately from cache on retry.

Unit Testing

Get repeatable responses for your test suite. After the initial run populates the cache, subsequent identical requests return the same cached response every time at zero cost. For deterministic first-run results, use temperature: 0 or a fixed seed.

Repeated Identical Requests

If your application makes the same request multiple times (same model, same messages, same parameters), caching ensures only the first call hits the provider. Subsequent identical calls return immediately from cache at zero cost.

Monitoring Cache Effectiveness

Cache hit and miss status is visible in your Activity log. Each cached request appears as a separate entry with a cache indicator, and you can filter the log to show only cached or non-cached requests. Every cache hit receives its own unique generation ID, so you can track individual cached responses independently.

​Enabling Caching

​1. Per-Request via Headers

​2. Via Presets

​How It Works

​Cache Key Details

​Precedence

​Concurrent Requests

​Supported Endpoints

​Request Headers

​Response Headers

​TTL (Time-to-Live)

​Cache Clearing

​Billing

​Limitations

​Data Retention

​Use Cases

​Agent Workflows

​Unit Testing

​Repeated Identical Requests

​Monitoring Cache Effectiveness

Enabling Caching

1. Per-Request via Headers

2. Via Presets

How It Works

Cache Key Details

Precedence

Concurrent Requests

Supported Endpoints

Request Headers

Response Headers

TTL (Time-to-Live)

Cache Clearing

Billing

Limitations

Data Retention

Use Cases

Agent Workflows

Unit Testing

Repeated Identical Requests

Monitoring Cache Effectiveness