cache_control property), OpenRouter uses provider sticky routing to maximize cache hits — see Provider Sticky Routing below for details.
Provider Sticky Routing
To maximize cache hit rates, OpenRouter uses provider sticky routing to route your subsequent requests to the same provider endpoint after a cached request. This works automatically with both implicit caching (e.g. OpenAI, DeepSeek, Gemini 2.5) and explicit caching (e.g. Anthropiccache_control breakpoints).
How it works:
- After a request that uses prompt caching, OpenRouter remembers which provider served your request.
- Subsequent requests for the same model are routed to the same provider, keeping your cache warm.
- Sticky routing only activates when the provider’s cache read pricing is cheaper than regular prompt pricing, ensuring you always benefit from cost savings.
- If the sticky provider becomes unavailable, OpenRouter automatically falls back to the next-best provider.
- Sticky routing is not used when you specify a manual provider order via
provider.order— in that case, your explicit ordering takes priority.
Using session_id for sticky sessions
For more explicit control over sticky routing, you can pass a session_id in your request. When a session_id is present, OpenRouter uses it directly as the sticky routing key instead of deriving one from message hashing. This is especially useful for multi-turn agentic workflows where the opening messages may change between requests but you still want to route to the same provider.
You can provide session_id in two ways:
- Request body: Include
session_idas a top-level field in your request body. If both are provided, the body value takes precedence. - Header: Set the
x-session-idHTTP header.
session_id must be at most 256 characters.
session_id is set, sticky routing activates on any successful request — even before cache usage is observed — so that subsequent requests in the same session benefit from prompt caching from the start. Without session_id, sticky routing only activates after a cache hit is detected.
When using router models like Auto Router or Pareto Router, sticky routing also pins the resolved model — not just the provider. This prevents the router from selecting a different model on each turn of a conversation. See Auto Router — Session Stickiness for details.
Inspecting cache usage
To see how much caching saved on each generation, you can:- Click the detail button on the Activity page
- Use the
/api/v1/generationAPI, documented here - Check the
prompt_tokens_detailsobject in the usage response included with every API response
cache_discount field in the response body will tell you how much the response saved on cache usage. Some providers, like Anthropic, will have a negative discount on cache writes, but a positive discount (which reduces total cost) on cache reads.
When using router models like Auto Router or Pareto Router, sticky routing also pins the resolved model — not just the provider. This prevents the router from selecting a different model on each turn of a conversation. See Auto Router — Session Stickiness for details.
Usage object fields
The usage object in API responses includes detailed cache metrics in theprompt_tokens_details field:
cached_tokens: Number of tokens read from the cache (cache hit). When this is greater than zero, you’re benefiting from cached content.cache_write_tokens: Number of tokens written to the cache. This appears on the first request when establishing a new cache entry.
OpenAI
Caching price changes:- Cache writes: no cost
- Cache reads: (depending on the model) charged at 0.25x or 0.50x the price of the original input pricing
Grok
Caching price changes:- Cache writes: no cost
- Cache reads: charged at x the price of the original input pricing
Moonshot AI
Caching price changes:- Cache writes: no cost
- Cache reads: charged at x the price of the original input pricing
Groq
Caching price changes:- Cache writes: no cost
- Cache reads: charged at x the price of the original input pricing
Alibaba Qwen
Caching price changes for explicit caching:- Cache writes: charged at x the price of the original input pricing
- Cache reads: charged at x the price of the original input pricing
cache_control: { "type": "ephemeral" } to content blocks you want to
cache, using the same syntax as Anthropic explicit caching. Cache writes use a
5-minute TTL.
Alibaba explicit caching is available on deepseek/deepseek-v3.2,
qwen/qwen3-max, qwen/qwen-plus, qwen/qwen3.6-plus,
qwen/qwen3-coder-plus, and qwen/qwen3-coder-flash. Snapshot endpoints,
including qwen/qwen3.5-plus-02-15 and qwen/qwen3.5-flash-02-23, do not
support explicit caching.
Example
Anthropic Claude
Caching price changes:- Cache writes (5-minute TTL): charged at x the price of the original input pricing
- Cache writes (1-hour TTL): charged at 2x the price of the original input pricing
- Cache reads: charged at x the price of the original input pricing
- Automatic caching: Add a single
cache_controlfield at the top level of your request. The system automatically applies the cache breakpoint to the last cacheable block and advances it forward as conversations grow. Best for multi-turn conversations. - Explicit cache breakpoints: Place
cache_controldirectly on individual content blocks for fine-grained control over exactly what gets cached. There is a limit of four explicit breakpoints. It is recommended to reserve the cache breakpoints for large bodies of text, such as character cards, CSV data, RAG data, book chapters, etc.
Automatic caching (top-level
cache_control) is only supported when requests are routed to the Anthropic provider directly. Amazon Bedrock and Google Vertex AI currently do not support top-level cache_control — when it is present, OpenRouter will only route to the Anthropic provider and exclude Bedrock and Vertex endpoints. Explicit per-block cache_control breakpoints work across all Anthropic-compatible providers including Bedrock and Vertex.Responses API support: The Responses API only supports automatic caching via top-level
cache_control. Explicit per-block cache breakpoints inside input items are not exposed through the Responses API — use the Chat Completions or Anthropic Messages API if you need fine-grained breakpoints."ttl": "1h" in the cache_control object.
Click here to read more about Anthropic prompt caching and its limitation.
Minimum token requirements
Each model has a minimum cacheable prompt length (see Anthropic’s cache limitations):- 4,096 tokens: Claude Opus 4.8, Claude Opus 4.7, Claude Opus 4.6, Claude Opus 4.5, Claude Haiku 4.5
- 2,048 tokens: Claude Haiku 3.5
- 1,024 tokens: Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4
Cache TTL Options
OpenRouter supports two cache TTL values for Anthropic:- 5 minutes (default):
"cache_control": { "type": "ephemeral" } - 1 hour:
"cache_control": { "type": "ephemeral", "ttl": "1h" }
Examples
Automatic caching (recommended for multi-turn conversations)
With automatic caching, addcache_control at the top level of the request. The system automatically caches all content up to the last cacheable block:
Explicit cache breakpoints (fine-grained control)
System message caching example (default 5-minute TTL):DeepSeek
Caching price changes:- Cache writes: charged at the same price as the original input pricing
- Cache reads: charged at x the price of the original input pricing
Google Gemini
Implicit Caching
Gemini 2.5 Pro and 2.5 Flash models now support implicit caching, providing automatic caching functionality similar to OpenAI’s automatic caching. Implicit caching works seamlessly — no manual setup or additionalcache_control breakpoints required.
Pricing Changes:
- No cache write or storage costs.
- Cached tokens are charged at x the original input token cost.
Pricing Changes for Cached Requests:
- Cache Writes: Charged at the input token cost plus 5 minutes of cache storage, calculated as follows:
- Cache Reads: Charged at × the original input token cost.
Supported Models and Limitations:
Only certain Gemini models support caching. Please consult Google’s Gemini API Pricing Documentation for the most current details. Cache Writes have a 5 minute Time-to-Live (TTL) that does not update. After 5 minutes, the cache expires and a new cache must be written. Gemini models have typically have a 4096 token minimum for cache write to occur. Cached tokens count towards the model’s maximum token usage. Gemini 2.5 Pro has a minimum of tokens, and Gemini 2.5 Flash has a minimum of tokens.How Gemini Prompt Caching works on OpenRouter:
OpenRouter simplifies Gemini cache management, abstracting away complexities:- You do not need to manually create, update, or delete caches.
- You do not need to manage cache names or TTL explicitly.
How to Enable Gemini Prompt Caching:
Gemini caching in OpenRouter requires you to insertcache_control breakpoints explicitly within message content, similar to Anthropic. We recommend using caching primarily for large content pieces (such as CSV files, lengthy character cards, retrieval augmented generation (RAG) data, or extensive textual sources).
Gemini has a single
systemInstruction field, and cached Gemini content
treats that systemInstruction as immutable. On OpenRouter, this means
cache_control inside the first system or developer message can cache
the normalized system prompt, but it cannot preserve an uncached dynamic tail
inside that same message. If you need part of your prompt to stay dynamic,
move that dynamic content into a later user message instead of appending it
after a cached block in the first system message.Examples:
System Message Caching Example
user message rather
than as uncached trailing content in the first system message.