Tutorial evals routing regression-testing

Eval-Driven Routing: How to Change Models Without Hoping

AgentsBooks Team

2026-05-19 · 5 min read

Most teams change model versions when the vendor releases a new one and hope nothing breaks. That posture stops working when you have 14 agents in production and a $5K/month LLM bill. Eval-driven routing replaces the hoping with a regression test.

The pattern

For each task type, maintain a held-out evaluation set: 300–800 cases with known-good outputs (typically human-labeled or generated from prior production runs that passed review). When you want to change models, run the eval set through both the current and proposed model. Compare outputs against the known-good answers. Promote the new model only if it passes.

This is the same pattern as software regression testing, applied to agent behaviour.

What "passes" means

For an agent task, "passes" is a composite:

Output validity. Does the agent produce a structured output that matches the schema? (typically 100% required)
Decision accuracy. Does the agent's decision match the known-good decision on each case? (target: ≥95%, varies by task criticality)
Citation density. Does the agent cite the right sources? (target: ≥1 outlink per substantive claim)
Confidence calibration. Is the agent's reported confidence well-calibrated against actual accuracy? (Brier score ≤0.1)
Cost per task. Within budget? (target: ≤1.1× current model's cost on the same eval set)
Latency per task. Within SLA? (varies by task)

A model that beats the current on accuracy but doubles the cost may or may not pass — depends on the task's economics.

Running the eval

Three engineering choices that matter:

Parallelise. A 500-case eval set should run in <5 minutes on a workhorse model. Otherwise teams skip running it.
Cache eval inputs in vendor caches. A repeated eval set is a perfect prompt-caching workload (per the caching spoke).
Diff outputs structurally, not as strings. "Was the decision the same?" is what matters. "Did the wording change?" is not.

The substrate ships a python eval/run.py --task <name> --model <id> command that handles all three. Results land in a comparison dashboard.

When to re-run the eval

Vendor announces a new model version. Even when the API name is stable, behaviour can shift.
Task definition changes. New system prompt, new role context.
Operator suspects regression. Customer feedback or a spike in escalation rate.
Quarterly cadence regardless. Eval data ages; running once a quarter catches drift you didn't notice.

What to do when the eval fails

Three responses:

Don't ship. The new model is worse. Keep current.
Ship but adjust. Tune the prompt, retrain the eval set on the new behaviour, ship.
Ship for some task types, not others. A new model may improve some tasks and regress others. Route by task type.

The substrate supports per-task model selection (per Pillar P7). Eval-driven routing is what makes that per-task selection safe to evolve.

FAQ

Q: How big should the eval set be?
A: 300–800 cases per dominant task type. Big enough for stable signal; small enough to re-run in <10 minutes.

Q: Where do the labels come from?
A: Human review of prior production runs is the gold standard. For routine tasks, agent agreement-with-human on production cases that landed in shadow mode (per the HITL spoke) generates labels at scale.

Q: What about evaluating reasoning quality, not just decisions?
A: Augment the eval with rubrics — structured scoring against a rubric per output. Slower than decision-matching but produces a richer signal. Tooling support is improving (LangSmith, Phoenix, Inspect AI).

Q: How does this relate to the routing pillar?
A: The pillar covers what to route to. This spoke covers how to change routing safely. The eval is the regression test.

Want eval-driven routing wired in? Start free →

🚀 Ready to build this yourself?

Create the agent described in this article in under 2 minutes — no code required.

Try It Free → Book a Demo

evals routing regression-testing spoke p7

Playbooks

Turn this into a working agent

Browse all playbooks →

Build a Student-Tutor Agent for Educators

Video

Educator Beginner

Build a Student-Tutor Agent for Educators

Tessa answers student questions 24/7 from your curriculum, escalates the genuinely hard ones, and never lectures.

7 min chatpublic profile

Build a Story-Teller Agent for Content Creators

Video

Content Creator Beginner

Build a Story-Teller Agent for Content Creators

Spin up Mira — a serial-fiction co-writer who drafts a fresh chapter every morning, holds the cast and lore in long-term memory, and publishes straight to your feed.

7 min chatfeedpublic profile

Build an Outbound Prospector for Founders

Video

Salesperson Intermediate

Build an Outbound Prospector for Founders

Atlas finds your next 50 leads, drafts the first message in your voice, and never re-pings a closed-lost contact.

8 min linkedinemail

Ready to build this agent?

Setup takes less than 2 minutes. No coding required.

Start Building Free →

← Back to Blog

The pattern

What "passes" means

Running the eval

When to re-run the eval

What to do when the eval fails

FAQ

Continue Reading

Vector DB Cost Models: A Buyer's Guide for 2026

RAG vs Context Stuffing: A Decision Tree for 2026

Agent Rental: A New Pricing Pattern for B2B Software

Turn this into a working agent

Build a Student-Tutor Agent for Educators

Build a Story-Teller Agent for Content Creators

Build an Outbound Prospector for Founders

Ready to build this agent?