Skip to content
Deep Dive memory knowledge rag

Memory & Knowledge for Agents

An agent without memory is a goldfish — competent at the moment, useless an hour later. Memory and Knowledge are what make agentic firms institutional rather than transactional. This essay is the memory pillar.

Three layers, each used distinctly

The single most common architecture mistake is treating "agent memory" as one thing. It's three:

  1. Working memory — the current task's scratchpad. Lives in the LLM context window for the duration of a single conversation/task. When the task ends, it's gone.
  2. Episodic memory — what happened, when, with whom. Append-only event log. Lives forever in the audit trail; queryable by time, by agent, by customer.
  3. Semantic memory — long-term facts about clients, regulations, products, contexts. Lives in a vector store keyed to the agent's Identity. Updated as new facts arrive; deduplicated.

The mistake: building a vector store and calling it "agent memory". That's just semantic memory. Without working memory you have a chat session, not an agent. Without episodic memory you have no audit trail, no learning loop, no way to answer "why did the agent do X last Tuesday?"

Each layer has its own implementation:

The 2026 inflection: 1M-token context

Until mid-2024, the RAG vs context-stuffing question had an easy answer: stuff what you can, RAG the rest. Context windows of 8K–128K tokens forced retrieval for any non-trivial knowledge base.

By 2026 the answer is conditional. Claude Opus 4.7 ships with 1M-token context. Gemini 3.1 ships with 1M. GPT-5.5 ships with 1M (via the Responses API). Stuffing a mid-sized firm's entire policy library into context is now technically possible.

But should you? Three reasons RAG is still load-bearing:

  1. Cost. 1M tokens × $5 per million input = $5/task; with caching that drops, but still ≥$0.25/task. Compared to a vector retrieval (~$0.001 + sub-second latency) plus a 50K-token augmented context, RAG wins ~10–50× on cost.
  2. Audit. RAG retrieves specific documents with citations; the agent's output can be traced to source documents. Context-stuffing returns an answer with no traceable provenance.
  3. Freshness. A vector store updates as documents change; a context-stuffed prompt is whatever the operator dropped in at build time.

The new decision tree:

  • Audit-critical work → RAG, always. Citation requirements (NIST AI RMF MEASURE-2.3, EU AI Act Art. 12) demand it.
  • High-volume routine work → RAG, for cost.
  • Long-form synthesis (research reports, complex analyses) → stuff what makes sense, RAG the rest. Anthropic's own context-engineering writeups cover the pattern.
  • One-off exploratory queries → stuff if you can; cost is bounded.

Databricks's long-context RAG research measured the cost-curve cross-over points; the rough rule of thumb: above ~30K relevant-context tokens per query, RAG always wins.

Knowledge — the firm-level layer

Memory is per-agent. Knowledge is per-firm.

A 50-person compliance firm has documents that every agent should be able to draw on: the firm's review playbook, the current regulator-position memos, the boilerplate templates, the brand voice guide. Building those into each agent's semantic memory is wasteful (N copies); leaving them at the firm level lets every agent draw on them with a single lookup.

The Knowledge primitive (Pillar P1) is where this lives. Documents are versioned, tagged with confidentiality classes, and selectively exposed to agents based on their role.

Why this matters for compliance: under ISO/IEC 42001, an organisation must document the behaviour boundaries of its AI systems. Knowledge is where those boundaries are encoded — and where they're auditable.

Vector DBs — the 2026 short list

For the semantic-memory layer specifically, four options dominate as of 2026:

  • Pinecone (docs) — managed, serverless, the default if you don't want to think about it. Pricing: pay-per-query + storage.
  • Weaviate (docs) — open-source + managed-cloud option; strong on hybrid (vector + keyword) search.
  • pgvector (repo) — Postgres extension; right if you're already on Postgres and want a single data plane.
  • Cloudflare Vectorize (docs) — edge-local; right for low-latency global use cases.

The choice is dominated by: existing data-plane (don't add a database), latency profile (edge-local vs region-local), and operational appetite (managed vs self-host). Capability differences across the top 4 are small enough not to drive the decision.

The which-vector-db-for-agents satellite walks through the decision tree with worked examples; the vector-db-cost-calculator models the unit economics at scale.

Context engineering — the new sub-discipline

How you structure the LLM's input determines output quality more than which model you use. Three patterns matter:

  1. Stable-first ordering. Put the cacheable parts of the prompt first (system message, firm knowledge, role context). Put the task-specific parts last. Anthropic's caching reads the first matching prefix; same with OpenAI. Order matters.
  2. Cite-as-you-go. Every retrieved fact should land in context with its source attribution (<source id="policy-3.2">...</source>). Models reliably preserve those tags in output. The audit trail builds itself.
  3. Strip noise. Long context isn't free. Retrieved chunks should be the most-relevant sub-paragraph, not the whole document. The Memory primitive supports tiered retrieval (paragraph → section → document) for this.

This isn't a new field — Anthropic's context-engineering essay is the most-cited canonical writeup. The pattern is: think of the LLM as a system whose behaviour you tune by the structure of the input, not just the content.

Counter-narrative: "RAG will die"

The strong-form version: context windows will keep growing, costs will keep dropping, and within 3 years no one will bother with retrieval. The weak-form version: RAG remains a tool for audit + cost, but a smaller part of the stack.

The weak-form is right. RAG's role will narrow but not vanish. Three reasons it persists:

  • Audit requirements (NIST, EU AI Act, SOC 2, ISO 42001) demand traceable citations. Context-stuffing produces opaque generations.
  • Freshness matters. A vector store updated nightly serves up-to-date facts; a context-stuffed prompt is stale by design.
  • Privacy isolation. RAG lets you control which documents land in which context — important when the firm's tenant boundaries map to confidentiality classes.

Frequently asked questions

Q: How does AgentsBooks store memory?
A: Working memory = LLM context + substrate scratchpad. Episodic memory = Cloud Firestore (append-only event log per agent). Semantic memory = pluggable vector store (Pinecone by default; Cloudflare Vectorize for edge use cases; pgvector for SQL-native deployments).

Q: What about long-term memory across model upgrades?
A: Episodic + semantic memory are model-agnostic. Switching from Claude Opus 4.6 to 4.7 (or to GPT, Gemini) doesn't touch them. Only working memory is per-call.

Q: How does this map to the 8 primitives?
A: Memory is per-agent. Knowledge is per-firm. Both are first-class primitives in the substrate, with their own storage, their own access control, and their own audit trail.


Want to see memory + knowledge working in practice? Build a memory-aware agent — start free →

🚀 Ready to build this yourself?

Create the agent described in this article in under 2 minutes — no code required.

Try It Free → Book a Demo
Share this article
𝕏 Share 🔗 LinkedIn
Playbooks

Turn this into a working agent

Browse all playbooks →

Ready to build this agent?

Setup takes less than 2 minutes. No coding required.

Start Building Free →
Image
Copy link
X
LinkedIn
Reddit
Download