Skip to content
Deep Dive caching cost-optimization spoke

Prompt Caching: The Optimization That Changes Routing Math

Prompt caching changed how cost-per-task math works. This spoke walks through the mechanic, when to use it, and the specific reorganization of system prompts that maximises cache hit rate.

What prompt caching does

When you make an LLM call, you typically send: system prompt + retrieved context + task-specific input + question. Without caching, every call pays full price for every token.

With caching: the stable prefix (system prompt + retrieved context that doesn't change task-to-task) is stored on the vendor's side and re-used. Subsequent calls within the cache TTL pay a fraction (Anthropic's published rate: cache reads at ~$0.075/M for Opus-4.x vs $15/M standard — 200× delta).

The cache TTL is 5 minutes for Anthropic, longer if you opt into extended caching. OpenAI's prompt-caching mechanism is similar; Google's matches.

What this means for routing

Without caching, the per-call cost is dominated by the full prompt length. A 50K-token system prompt costs the same on every call.

With caching, the per-call cost is dominated by the non-cached portion. The same 50K-token system prompt costs full price once, then near-zero for the next 5 minutes of calls.

This changes routing math (per Pillar P7). A task that you wouldn't dare run on Opus because the per-call cost is too high becomes cheap if you can keep cache hit rate above 70%.

The reorganization

Cache hit rate is determined by prefix stability. The cache reads the longest matching prefix. So:

Put stable content first.

System prompt (stable) → Firm knowledge (stable) → Role context (stable) → Per-customer context (might be cacheable per-customer) → Per-task input (not cacheable) → User question (not cacheable).

Don't interleave. If you have a stable section, then a dynamic section, then another stable section — the cache stops at the first dynamic part and you pay full price for everything after.

Use prompt prefix markers. Anthropic's API has explicit cache_control: ephemeral markers you place at boundaries between cacheable and dynamic sections. Use them.

What hit rate to target

For agents with stable system prompts + intermittent calls: >70% cache hit rate is the threshold where caching meaningfully changes economics. Below 50% the overhead (writing the cache on first call) may exceed savings.

The substrate emits per-task cache hit metrics. The operator dashboard surfaces them at the agent + task level so the team can identify caching regressions before they show up on the bill.

When NOT to cache

Three cases where caching is wrong:

  1. Cold-start agents. An agent that fires once a day has its cache expire between calls. The first call pays cache-write overhead; the cache never gets used.
  2. High-cardinality context. If every call has a unique 50K-token customer context, there's no stable prefix. Cache won't help.
  3. Privacy-isolation requirements. Some regimes require per-tenant isolation at the vendor level. Confirm with the vendor that their cache implementation maintains your isolation guarantees.

FAQ

Q: Does prompt caching cost extra?
A: Cache writes cost slightly more than standard input (Anthropic's published rate is 1.25× for cache writes). Cache reads are a fraction of standard. Net: positive for any prompt that's read >2× before expiring.

Q: How long is the cache TTL?
A: Anthropic: 5 minutes default; longer with extended caching. OpenAI: similar. Google: matched. All three are improving over time.

Q: How does this relate to the P7 routing pillar?
A: This is the specific optimization that makes the routing math work. Without caching, frontier models are too expensive to run at scale. With caching, they become viable for far more task types.


Want caching wired into your agents? Start free →

🚀 Ready to build this yourself?

Create the agent described in this article in under 2 minutes — no code required.

Try It Free → Book a Demo
Share this article
𝕏 Share 🔗 LinkedIn
Playbooks

Turn this into a working agent

Browse all playbooks →

Ready to build this agent?

Setup takes less than 2 minutes. No coding required.

Start Building Free →
Image
Copy link
X
LinkedIn
Reddit
Download