1M-token context windows changed the question but didn't kill RAG. This essay is the practical decision tree.
The new question
In 2023, the default was "stuff what you can, RAG the rest". Context windows were 8K–128K; anything bigger needed retrieval.
In 2026, Claude Opus 4.7 + Gemini 3.1 + GPT-5.5 all ship 1M-token context. A mid-sized firm's entire policy library fits. So the question becomes: when is RAG still better?
The decision tree
Start at the top, follow the first matching branch:
1. Is the query subject to audit (compliance, regulatory, legal-sign-off)?
→ RAG, always. NIST AI RMF MEASURE-2.3, EU AI Act Art. 12, ISO 42001 — all require traceable citations. Context-stuffing produces opaque generations; RAG produces traceable ones.
2. Is the relevant knowledge >100K tokens AND >50% of it irrelevant to typical queries?
→ RAG. Cost-wise, stuffing 100K of mostly-irrelevant tokens on every call is wasteful even with caching.
3. Is freshness >5 minutes important (knowledge that updates often)?
→ RAG. Vector store updates as documents change; context-stuffed prompt is whatever was dropped in at build time. Cache TTL is 5 minutes.
4. Is the query exploratory / one-off / synthesis-heavy?
→ Context-stuffing. RAG retrieves narrowly; for cross-document synthesis, stuffing gives the model more material to compose from.
5. Otherwise:
→ Context-stuffing with caching. If the knowledge fits in the cacheable prefix and is mostly relevant, cache it and stuff.
The math
Databricks's long-context RAG research found the cost-curve crossover roughly at 30K relevant-context tokens per query. Above that, RAG dominates on cost. Below, the cost difference is small enough that the right answer is decided by audit + freshness + synthesis needs.
A concrete example:
A KYC review agent has 80K tokens of firm policy + 50K tokens of case context.
- Stuff everything (130K tokens), no caching: ~$1.95 per call on Opus 4.x.
- Stuff everything with 70% cache hit on the policy: ~$0.40 per call.
- RAG: retrieve top-20 policy chunks (~10K tokens) + case context (50K tokens): ~$0.90 per call. Plus citation-trail audit-grade.
In this case: RAG wins on audit; cached-stuffing wins on cost. The right answer depends on which constraint is binding. For a regulated firm, audit always wins.
Hybrid is fine
Real production agents often use both. Pattern:
- Stable firm knowledge → cached context (always present, always cited via inline source IDs).
- Per-case dynamic context → stuffed (per-case, not cacheable).
- Long-tail policy reference (the 90% of policy not relevant to most cases) → RAG (retrieved when needed, with citation trail).
This is what the Memory primitive supports natively — tiered access to all three layers (per Pillar P8).
Common mistakes
- Defaulting to RAG without measuring. Adds latency + cost for queries where stuffing would work fine.
- Defaulting to stuffing because "it's simpler". Loses audit trail. Compliance auditor reads the agent's output and asks "which policy?" — no answer.
- Re-retrieving the same chunks every call. Cache them. The substrate's semantic memory layer supports cached retrieval.
FAQ
Q: What about agents that do iterative reasoning (read, think, retrieve more, think more)?
A: That's just RAG with multiple retrieval rounds. Each round contributes citable chunks to the audit trail.
Q: Will RAG die when context windows hit 10M tokens?
A: It will narrow (the cost-curve crossover shifts). It won't die — audit + freshness keep RAG load-bearing regardless of context size.
Q: How does this map to Pillar P8?
A: The pillar covers all three memory layers. This spoke is the practical decision tree between the two specific patterns most teams agonise over.
Want a tested RAG + context-stuffing setup? Start free →