Most teams change model versions when the vendor releases a new one and hope nothing breaks. That posture stops working when you have 14 agents in production and a $5K/month LLM bill. Eval-driven routing replaces the hoping with a regression test.
The pattern
For each task type, maintain a held-out evaluation set: 300–800 cases with known-good outputs (typically human-labeled or generated from prior production runs that passed review). When you want to change models, run the eval set through both the current and proposed model. Compare outputs against the known-good answers. Promote the new model only if it passes.
This is the same pattern as software regression testing, applied to agent behaviour.
What "passes" means
For an agent task, "passes" is a composite:
- Output validity. Does the agent produce a structured output that matches the schema? (typically 100% required)
- Decision accuracy. Does the agent's decision match the known-good decision on each case? (target: ≥95%, varies by task criticality)
- Citation density. Does the agent cite the right sources? (target: ≥1 outlink per substantive claim)
- Confidence calibration. Is the agent's reported confidence well-calibrated against actual accuracy? (Brier score ≤0.1)
- Cost per task. Within budget? (target: ≤1.1× current model's cost on the same eval set)
- Latency per task. Within SLA? (varies by task)
A model that beats the current on accuracy but doubles the cost may or may not pass — depends on the task's economics.
Running the eval
Three engineering choices that matter:
- Parallelise. A 500-case eval set should run in <5 minutes on a workhorse model. Otherwise teams skip running it.
- Cache eval inputs in vendor caches. A repeated eval set is a perfect prompt-caching workload (per the caching spoke).
- Diff outputs structurally, not as strings. "Was the decision the same?" is what matters. "Did the wording change?" is not.
The substrate ships a python eval/run.py --task <name> --model <id> command that handles all three. Results land in a comparison dashboard.
When to re-run the eval
- Vendor announces a new model version. Even when the API name is stable, behaviour can shift.
- Task definition changes. New system prompt, new role context.
- Operator suspects regression. Customer feedback or a spike in escalation rate.
- Quarterly cadence regardless. Eval data ages; running once a quarter catches drift you didn't notice.
What to do when the eval fails
Three responses:
- Don't ship. The new model is worse. Keep current.
- Ship but adjust. Tune the prompt, retrain the eval set on the new behaviour, ship.
- Ship for some task types, not others. A new model may improve some tasks and regress others. Route by task type.
The substrate supports per-task model selection (per Pillar P7). Eval-driven routing is what makes that per-task selection safe to evolve.
FAQ
Q: How big should the eval set be?
A: 300–800 cases per dominant task type. Big enough for stable signal; small enough to re-run in <10 minutes.
Q: Where do the labels come from?
A: Human review of prior production runs is the gold standard. For routine tasks, agent agreement-with-human on production cases that landed in shadow mode (per the HITL spoke) generates labels at scale.
Q: What about evaluating reasoning quality, not just decisions?
A: Augment the eval with rubrics — structured scoring against a rubric per output. Slower than decision-matching but produces a richer signal. Tooling support is improving (LangSmith, Phoenix, Inspect AI).
Q: How does this relate to the routing pillar?
A: The pillar covers what to route to. This spoke covers how to change routing safely. The eval is the regression test.
Want eval-driven routing wired in? Start free →