# Model Routing & Cost-Aware Agent Design

> Cost-per-million-tokens is the wrong denominator in 2026. Prompt caching, reasoning models, and three-tier routing changed the math. How to design routing that's 10× cheaper than the naive default without losing quality.

URL: https://agentsbooks.com/blog/model-routing-cost-aware
Published: 2026-05-19T13:30:00Z
Category: Deep Dive
Tags: model-routing, cost, claude, gpt, gemini, prompt-caching, pillar

Three years ago the operator question was *which model do I use?* In 2026 it's *which model do I use for which task, at which cost, under which cache assumption?* The denominator changed. This essay is the routing pillar.

## The old denominator: cost per million tokens

For 2023–early-2024 the meaningful comparison was list price per million input + output tokens. [Artificial Analysis](https://artificialanalysis.ai/) maintained the canonical leaderboard; vendor blogs compared themselves against it.

That denominator is dead. Two things killed it:

1. **Prompt caching.** Anthropic shipped prompt caching in late 2024; OpenAI shipped a comparable mechanism shortly after; Google followed. Cached tokens cost a fraction (Anthropic's cache-hit rate is currently $0.075 per million for Opus-4.x cache reads vs $15 per million standard input — a 200× delta).
2. **Reasoning models.** o1 / o3 / Claude-with-extended-thinking pushed the *useful* output up but also pushed the *token consumption* up by 5–20×. Cost per million tokens stayed flat; cost per useful answer changed shape.

The new denominator: **cost per cached, completed task**. That's what an agentic firm actually pays for; that's what should drive routing decisions.

## The 2026 model landscape (snapshot)

Live pricing pages: [Anthropic](https://claude.com/pricing) · [OpenAI](https://openai.com/api/pricing/) · [Google](https://ai.google.dev/gemini-api/docs/pricing). Snapshots drift weekly — re-check before locking a routing config.

Three tiers, roughly:

- **Frontier reasoning.** Claude Opus 4.7, GPT-5.5, Gemini 3.1 Ultra. Used for: hard chains, multi-step planning, code generation at scale, audit-grade compliance review.
- **Workhorse.** Claude Sonnet 4.6, GPT-5.5 Mini, Gemini 3.1 Pro. Used for: 80–90% of an agentic firm's tasks. Triage, extraction, summarisation, routine reply.
- **Cheap-and-fast.** Claude Haiku 4.5, GPT-5.5 Nano, Gemini 3.1 Flash. Used for: classification, routing decisions about *other* agents, simple data shaping.

A reasonable default routing posture: send everything to the workhorse tier; escalate to frontier on signals (uncertainty, complexity, escalation flag); demote to cheap-and-fast on signals (high volume, low stakes, classification only).

## What "cost-aware" actually means

Three loops happen at different timescales:

### Loop 1 — per-task (single LLM call)

Cheapest model that *meets the quality bar* for this specific task. Most teams overshoot here. A "summarise this email and route it" task absolutely does not need Opus 4.7; Haiku 4.5 routinely does it for ~1/15 the cost.

Concrete pattern in our Heart configuration: each task carries a `model` field that *defaults to workhorse* and overrides per-task. The override is the *only* tunable an operator should think about per task.

### Loop 2 — per-eval-run (weekly)

Re-run the eval harness against alternative routing configurations. Did Haiku 4.5 hit the quality bar last week? Maybe this week Sonnet 4.6 is necessary. Maybe Opus 4.7 is overkill where you currently use it.

The eval harness is what makes routing safe. Without it, every routing change is a coin flip. With it, routing is a regression test.

### Loop 3 — per-pricing-tick (quarterly)

When vendor pricing shifts (Anthropic dropped Opus prices 30% in Q4 2025; OpenAI dropped GPT-5.5 Mini 50% in Q1 2026), re-run the cost-per-task math across the whole fleet. Often a whole tier becomes the new default.

[Our rolling-monitor script](https://github.com/agentsbooks/agentsbooks-marketing/blob/main/generator/rolling_monitor.py) polls vendor pricing pages weekly; the [check_citations.py liveness tool](https://github.com/agentsbooks/agentsbooks-marketing/blob/main/generator/check_citations.py) confirms the pages haven't moved. Operators get a quarterly digest in the dashboard.

## Routing signals that actually work

Five signals are load-bearing. The others (system-prompt-token-length, latency-percentile-at-p95, etc.) are interesting but rarely change the answer.

1. **Estimated complexity** (cheap classifier on the input). Above threshold → frontier; below → workhorse.
2. **Task category** (KYC review = frontier; daily-digest summary = workhorse; emoji-routing = cheap).
3. **Cache fit** (if the task's system prompt + Knowledge context is >70% cacheable, the routing math changes — frontier becomes cheaper than workhorse on cache hits).
4. **Confidence on previous attempt** (re-tries escalate model tier).
5. **Audit flag** (regulated work always escalates to frontier even if the workhorse would suffice — the cost delta is small, the trail discipline matters).

A naive implementation: a routing agent (itself a cheap-and-fast model call) that evaluates the five signals and returns a tier. That routing agent runs in single-digit milliseconds and adds <0.01¢ per parent task. The savings on the parent task often exceed 10×.

## What about prompt caching specifically?

Prompt caching changes routing math. Two examples:

**Without caching:** a KYC-review agent's system prompt + policy library is ~50K tokens. At Opus 4.7 standard input ($15/MTok), each task costs ~$0.75 just for the prompt. Cumulative on 1000 reviews/day = $750/day.

**With caching (75% cache-hit rate):** same agent. 50K tokens × 25% standard ($15/MTok) + 75% cached ($1.50/MTok) ≈ $0.244 per task on the prompt. Cumulative = $244/day. **3× cheaper, same model.**

For a firm with $50K/month in LLM spend, getting prompt caching right is worth $33K/month — more than any routing optimization typically delivers.

Tactical advice: organise the system prompt + Knowledge context so the *stable* parts come first (cacheable) and the *task-specific* parts come last (uncacheable). Anthropic's [context-engineering writeups](https://www.anthropic.com/engineering/built-multi-agent-research-system) cover the patterns in detail.

## What about OpenRouter / aggregators?

[OpenRouter](https://openrouter.ai/) is useful for two things: (1) testing routing strategies against multiple models without N vendor contracts, and (2) handling failover when one vendor is down. It's not useful as a production substitute for vendor-direct contracts at scale — the markup adds 5–15% you don't need.

Pattern: develop against OpenRouter for breadth; lock vendor-direct contracts for the two or three tiers your fleet actually uses; keep OpenRouter as a failover fallback.

## Counter-narrative: "the cheapest model always wins"

Sometimes. Mostly not.

The argument is: model prices keep dropping; capability keeps converging; eventually the cheap model is good enough for everything. That's mostly true at the *long-tail* (Haiku 4.5 today does what GPT-4 did in 2023, at 1/100 the cost), but it's not true at the *frontier* — the gap between today's Opus and today's Haiku on hard reasoning is still 2–5× in eval scores.

So: yes, the cheap tier keeps getting better. No, you can't run audit-grade compliance work on the cheap tier today. The routing posture should bias cheap by default but reserve the right to escalate.

## What about Llama / Mistral / DeepSeek / Qwen?

Open-weight + self-hosted models are increasingly viable for:

- **Privacy-sensitive work** where the data can't leave the firm's perimeter.
- **High-volume / low-stakes** work where the per-task cost matters and the model can run on the firm's own GPUs.
- **Hot-failover** when frontier vendors hit capacity caps.

The cost math is GPU-time + amortised hardware vs vendor list price. Below ~500K daily tokens, vendor-managed is cheaper. Above ~5M daily tokens with consistent traffic, self-hosted starts winning. The crossover is hardware-dependent and worth modelling for each firm.

## Putting it together: a routing config for a 50-person agentic firm

The firm has ~14 agents across compliance, accounting, and customer support.

- **Default routing tier:** workhorse (Sonnet 4.6 / GPT-5.5 Mini / Gemini 3.1 Pro depending on Knowledge fit).
- **Frontier escalation triggers:** complexity > 0.7 OR audit_flag = true OR retry_count ≥ 2.
- **Cheap-tier demotion triggers:** task_category = "classification" OR task_category = "routing".
- **Prompt caching:** enabled on every agent with stable system prompts (most of them).
- **Failover:** vendor-A primary, vendor-B at 1.5% baseline traffic for warm capacity, OpenRouter for capacity caps.
- **Eval cadence:** weekly re-evaluation on a 500-task held-out set; re-routing decisions queued for human approval.

Total monthly LLM spend at this firm scale: $4–9K depending on volume. With naive routing (Opus-only): $40–80K. **10× cost differential, indistinguishable output quality on the dominant task mix.**

## Frequently asked questions

**Q: Doesn't routing add latency?**
A: A cheap-tier routing model adds 50–200ms before the parent call. Net latency usually drops because the chosen model is faster than the worst-case-Opus assumption would be.

**Q: How often should I re-run the eval harness?**
A: Weekly is the sweet spot for an active firm. Vendor models change behaviour with version bumps even when the API name is stable.

**Q: What's the right eval set size?**
A: Big enough for stable signal on each task type; small enough to re-run in <10 minutes. 300–800 tasks per dominant category is typical.

**Q: How does this map to the 8 primitives?**
A: The Brain primitive carries the model selector. The Heart primitive carries per-task overrides. Memory captures the per-task cost. Knowledge captures the cacheable shared context. See [Pillar P1](/blog/eight-primitives-agentic-firm).

---

*Want to see cost-aware routing working in the substrate? [Start free →](/login?returnTo=/onboarding)*
