# Memory & Knowledge for Agents

> Three layers — working, episodic, semantic. Why 1M-token contexts didn't kill RAG. The 2026 vector-DB shortlist and the context-engineering patterns that actually move quality.

URL: https://agentsbooks.com/blog/agent-memory-knowledge
Published: 2026-05-19T15:30:00Z
Category: Deep Dive
Tags: memory, knowledge, rag, vector-db, context-engineering, pillar

An agent without memory is a goldfish — competent at the moment, useless an hour later. Memory and Knowledge are what make agentic firms *institutional* rather than transactional. This essay is the memory pillar.

## Three layers, each used distinctly

The single most common architecture mistake is treating "agent memory" as one thing. It's three:

1. **Working memory** — the current task's scratchpad. Lives in the LLM context window for the duration of a single conversation/task. When the task ends, it's gone.
2. **Episodic memory** — what happened, when, with whom. Append-only event log. Lives forever in the audit trail; queryable by time, by agent, by customer.
3. **Semantic memory** — long-term facts about clients, regulations, products, contexts. Lives in a vector store keyed to the agent's Identity. Updated as new facts arrive; deduplicated.

The mistake: building a vector store and calling it "agent memory". That's just semantic memory. Without working memory you have a chat session, not an agent. Without episodic memory you have no audit trail, no learning loop, no way to answer *"why did the agent do X last Tuesday?"*

Each layer has its own implementation:

- Working memory = the LLM context window + structured scratchpad in the substrate.
- Episodic memory = append-only event store (Cloud Firestore in our case; Postgres/BigQuery in others).
- Semantic memory = vector store ([Pinecone](https://www.pinecone.io/blog/), [Weaviate](https://docs.weaviate.io/weaviate), [pgvector](https://github.com/pgvector/pgvector), [Cloudflare Vectorize](https://developers.cloudflare.com/vectorize/), [MongoDB Atlas Vector Search](https://www.mongodb.com/products/platform/atlas-vector-search)).

## The 2026 inflection: 1M-token context

Until mid-2024, the RAG vs context-stuffing question had an easy answer: stuff what you can, RAG the rest. Context windows of 8K–128K tokens forced retrieval for any non-trivial knowledge base.

By 2026 the answer is conditional. Claude Opus 4.7 ships with 1M-token context. Gemini 3.1 ships with 1M. GPT-5.5 ships with 1M (via the Responses API). Stuffing a mid-sized firm's entire policy library into context is now technically possible.

But *should you?* Three reasons RAG is still load-bearing:

1. **Cost.** 1M tokens × $5 per million input = $5/task; with caching that drops, but still ≥$0.25/task. Compared to a vector retrieval (~$0.001 + sub-second latency) plus a 50K-token augmented context, RAG wins ~10–50× on cost.
2. **Audit.** RAG retrieves *specific* documents with citations; the agent's output can be traced to source documents. Context-stuffing returns an answer with no traceable provenance.
3. **Freshness.** A vector store updates as documents change; a context-stuffed prompt is whatever the operator dropped in at build time.

The new decision tree:

- **Audit-critical work** → RAG, always. Citation requirements (NIST AI RMF MEASURE-2.3, EU AI Act Art. 12) demand it.
- **High-volume routine work** → RAG, for cost.
- **Long-form synthesis** (research reports, complex analyses) → stuff what makes sense, RAG the rest. Anthropic's [own context-engineering writeups](https://www.anthropic.com/engineering/built-multi-agent-research-system) cover the pattern.
- **One-off exploratory queries** → stuff if you can; cost is bounded.

Databricks's [long-context RAG research](https://www.databricks.com/blog/long-context-rag-performance-llms) measured the cost-curve cross-over points; the rough rule of thumb: above ~30K relevant-context tokens per query, RAG always wins.

## Knowledge — the firm-level layer

Memory is per-agent. Knowledge is per-firm.

A 50-person compliance firm has documents that every agent should be able to draw on: the firm's review playbook, the current regulator-position memos, the boilerplate templates, the brand voice guide. Building those into each agent's semantic memory is wasteful (N copies); leaving them at the firm level lets every agent draw on them with a single lookup.

The Knowledge primitive ([Pillar P1](/blog/eight-primitives-agentic-firm)) is where this lives. Documents are versioned, tagged with confidentiality classes, and selectively exposed to agents based on their role.

Why this matters for compliance: under [ISO/IEC 42001](https://www.iso.org/standard/42001), an organisation must document the *behaviour boundaries* of its AI systems. Knowledge is where those boundaries are encoded — and where they're auditable.

## Vector DBs — the 2026 short list

For the semantic-memory layer specifically, four options dominate as of 2026:

- **Pinecone** ([docs](https://docs.pinecone.io/guides/get-started/overview)) — managed, serverless, the default if you don't want to think about it. Pricing: pay-per-query + storage.
- **Weaviate** ([docs](https://docs.weaviate.io/weaviate)) — open-source + managed-cloud option; strong on hybrid (vector + keyword) search.
- **pgvector** ([repo](https://github.com/pgvector/pgvector)) — Postgres extension; right if you're already on Postgres and want a single data plane.
- **Cloudflare Vectorize** ([docs](https://developers.cloudflare.com/vectorize/)) — edge-local; right for low-latency global use cases.

The choice is dominated by: existing data-plane (don't add a database), latency profile (edge-local vs region-local), and operational appetite (managed vs self-host). Capability differences across the top 4 are small enough not to drive the decision.

The [which-vector-db-for-agents satellite](https://which-vector-db-for-agents.roei-020.workers.dev/) walks through the decision tree with worked examples; the [vector-db-cost-calculator](https://vector-db-cost-calculator.roei-020.workers.dev/) models the unit economics at scale.

## Context engineering — the new sub-discipline

How you structure the LLM's input determines output quality more than which model you use. Three patterns matter:

1. **Stable-first ordering.** Put the *cacheable* parts of the prompt first (system message, firm knowledge, role context). Put the *task-specific* parts last. Anthropic's caching reads the first matching prefix; same with OpenAI. Order matters.
2. **Cite-as-you-go.** Every retrieved fact should land in context with its source attribution (`<source id="policy-3.2">...</source>`). Models reliably preserve those tags in output. The audit trail builds itself.
3. **Strip noise.** Long context isn't free. Retrieved chunks should be the most-relevant sub-paragraph, not the whole document. The Memory primitive supports tiered retrieval (paragraph → section → document) for this.

This isn't a new field — Anthropic's [context-engineering essay](https://www.anthropic.com/engineering/context-engineering) is the most-cited canonical writeup. The pattern is: think of the LLM as a system whose behaviour you tune by the *structure* of the input, not just the *content*.

## Counter-narrative: "RAG will die"

The strong-form version: context windows will keep growing, costs will keep dropping, and within 3 years no one will bother with retrieval. The weak-form version: RAG remains a tool for audit + cost, but a smaller part of the stack.

The weak-form is right. RAG's role will narrow but not vanish. Three reasons it persists:

- Audit requirements (NIST, EU AI Act, SOC 2, ISO 42001) demand traceable citations. Context-stuffing produces opaque generations.
- Freshness matters. A vector store updated nightly serves up-to-date facts; a context-stuffed prompt is stale by design.
- Privacy isolation. RAG lets you control which documents land in which context — important when the firm's tenant boundaries map to confidentiality classes.

## Frequently asked questions

**Q: How does AgentsBooks store memory?**
A: Working memory = LLM context + substrate scratchpad. Episodic memory = Cloud Firestore (append-only event log per agent). Semantic memory = pluggable vector store (Pinecone by default; Cloudflare Vectorize for edge use cases; pgvector for SQL-native deployments).

**Q: What about long-term memory across model upgrades?**
A: Episodic + semantic memory are model-agnostic. Switching from Claude Opus 4.6 to 4.7 (or to GPT, Gemini) doesn't touch them. Only working memory is per-call.

**Q: How does this map to the [8 primitives](/blog/eight-primitives-agentic-firm)?**
A: Memory is per-agent. Knowledge is per-firm. Both are first-class primitives in the substrate, with their own storage, their own access control, and their own audit trail.

---

*Want to see memory + knowledge working in practice? [Build a memory-aware agent — start free →](/login?returnTo=/onboarding)*