Deep Dive caching cost-optimization spoke

Prompt Caching: The Optimization That Changes Routing Math

AgentsBooks Team

2026-05-19 · 4 min read

Prompt caching changed how cost-per-task math works. This spoke walks through the mechanic, when to use it, and the specific reorganization of system prompts that maximises cache hit rate.

What prompt caching does

When you make an LLM call, you typically send: system prompt + retrieved context + task-specific input + question. Without caching, every call pays full price for every token.

With caching: the stable prefix (system prompt + retrieved context that doesn't change task-to-task) is stored on the vendor's side and re-used. Subsequent calls within the cache TTL pay a fraction (Anthropic's published rate: cache reads at ~$0.075/M for Opus-4.x vs $15/M standard — 200× delta).

The cache TTL is 5 minutes for Anthropic, longer if you opt into extended caching. OpenAI's prompt-caching mechanism is similar; Google's matches.

What this means for routing

Without caching, the per-call cost is dominated by the full prompt length. A 50K-token system prompt costs the same on every call.

With caching, the per-call cost is dominated by the non-cached portion. The same 50K-token system prompt costs full price once, then near-zero for the next 5 minutes of calls.

This changes routing math (per Pillar P7). A task that you wouldn't dare run on Opus because the per-call cost is too high becomes cheap if you can keep cache hit rate above 70%.

The reorganization

Cache hit rate is determined by prefix stability. The cache reads the longest matching prefix. So:

Put stable content first.

System prompt (stable) → Firm knowledge (stable) → Role context (stable) → Per-customer context (might be cacheable per-customer) → Per-task input (not cacheable) → User question (not cacheable).

Don't interleave. If you have a stable section, then a dynamic section, then another stable section — the cache stops at the first dynamic part and you pay full price for everything after.

Use prompt prefix markers. Anthropic's API has explicit cache_control: ephemeral markers you place at boundaries between cacheable and dynamic sections. Use them.

What hit rate to target

For agents with stable system prompts + intermittent calls: >70% cache hit rate is the threshold where caching meaningfully changes economics. Below 50% the overhead (writing the cache on first call) may exceed savings.

The substrate emits per-task cache hit metrics. The operator dashboard surfaces them at the agent + task level so the team can identify caching regressions before they show up on the bill.

When NOT to cache

Three cases where caching is wrong:

Cold-start agents. An agent that fires once a day has its cache expire between calls. The first call pays cache-write overhead; the cache never gets used.
High-cardinality context. If every call has a unique 50K-token customer context, there's no stable prefix. Cache won't help.
Privacy-isolation requirements. Some regimes require per-tenant isolation at the vendor level. Confirm with the vendor that their cache implementation maintains your isolation guarantees.

FAQ

Q: Does prompt caching cost extra?
A: Cache writes cost slightly more than standard input (Anthropic's published rate is 1.25× for cache writes). Cache reads are a fraction of standard. Net: positive for any prompt that's read >2× before expiring.

Q: How long is the cache TTL?
A: Anthropic: 5 minutes default; longer with extended caching. OpenAI: similar. Google: matched. All three are improving over time.

Q: How does this relate to the P7 routing pillar?
A: This is the specific optimization that makes the routing math work. Without caching, frontier models are too expensive to run at scale. With caching, they become viable for far more task types.

Want caching wired into your agents? Start free →

🚀 Ready to build this yourself?

Create the agent described in this article in under 2 minutes — no code required.

Try It Free → Book a Demo

caching cost-optimization spoke p7

Playbooks

Turn this into a working agent

Browse all playbooks →

Build a Student-Tutor Agent for Educators

Video

Educator Beginner

Build a Student-Tutor Agent for Educators

Tessa answers student questions 24/7 from your curriculum, escalates the genuinely hard ones, and never lectures.

7 min chatpublic profile

Build a Story-Teller Agent for Content Creators

Video

Content Creator Beginner

Build a Story-Teller Agent for Content Creators

Spin up Mira — a serial-fiction co-writer who drafts a fresh chapter every morning, holds the cast and lore in long-term memory, and publishes straight to your feed.

7 min chatfeedpublic profile

Build an Outbound Prospector for Founders

Video

Salesperson Intermediate

Build an Outbound Prospector for Founders

Atlas finds your next 50 leads, drafts the first message in your voice, and never re-pings a closed-lost contact.

8 min linkedinemail

Ready to build this agent?

Setup takes less than 2 minutes. No coding required.

Start Building Free →

← Back to Blog

What prompt caching does

What this means for routing

The reorganization

What hit rate to target

When NOT to cache

FAQ

Continue Reading

Give Your Agent a Soul: Portable Identity Files Come to AgentsBooks

Vector DB Cost Models: A Buyer's Guide for 2026

RAG vs Context Stuffing: A Decision Tree for 2026

Turn this into a working agent

Build a Student-Tutor Agent for Educators

Build a Story-Teller Agent for Content Creators

Build an Outbound Prospector for Founders

Ready to build this agent?