Deep Dive memory knowledge rag

Memory & Knowledge for Agents

AgentsBooks Team

2026-05-19 · 11 min read

An agent without memory is a goldfish — competent at the moment, useless an hour later. Memory and Knowledge are what make agentic firms institutional rather than transactional. This essay is the memory pillar.

Three layers, each used distinctly

The single most common architecture mistake is treating "agent memory" as one thing. It's three:

Working memory — the current task's scratchpad. Lives in the LLM context window for the duration of a single conversation/task. When the task ends, it's gone.
Episodic memory — what happened, when, with whom. Append-only event log. Lives forever in the audit trail; queryable by time, by agent, by customer.
Semantic memory — long-term facts about clients, regulations, products, contexts. Lives in a vector store keyed to the agent's Identity. Updated as new facts arrive; deduplicated.

The mistake: building a vector store and calling it "agent memory". That's just semantic memory. Without working memory you have a chat session, not an agent. Without episodic memory you have no audit trail, no learning loop, no way to answer "why did the agent do X last Tuesday?"

Each layer has its own implementation:

Working memory = the LLM context window + structured scratchpad in the substrate.
Episodic memory = append-only event store (Cloud Firestore in our case; Postgres/BigQuery in others).
Semantic memory = vector store (Pinecone, Weaviate, pgvector, Cloudflare Vectorize, MongoDB Atlas Vector Search).

The 2026 inflection: 1M-token context

Until mid-2024, the RAG vs context-stuffing question had an easy answer: stuff what you can, RAG the rest. Context windows of 8K–128K tokens forced retrieval for any non-trivial knowledge base.

By 2026 the answer is conditional. Claude Opus 4.7 ships with 1M-token context. Gemini 3.1 ships with 1M. GPT-5.5 ships with 1M (via the Responses API). Stuffing a mid-sized firm's entire policy library into context is now technically possible.

But should you? Three reasons RAG is still load-bearing:

Cost. 1M tokens × $5 per million input = $5/task; with caching that drops, but still ≥$0.25/task. Compared to a vector retrieval (~$0.001 + sub-second latency) plus a 50K-token augmented context, RAG wins ~10–50× on cost.
Audit. RAG retrieves specific documents with citations; the agent's output can be traced to source documents. Context-stuffing returns an answer with no traceable provenance.
Freshness. A vector store updates as documents change; a context-stuffed prompt is whatever the operator dropped in at build time.

The new decision tree:

Audit-critical work → RAG, always. Citation requirements (NIST AI RMF MEASURE-2.3, EU AI Act Art. 12) demand it.
High-volume routine work → RAG, for cost.
Long-form synthesis (research reports, complex analyses) → stuff what makes sense, RAG the rest. Anthropic's own context-engineering writeups cover the pattern.
One-off exploratory queries → stuff if you can; cost is bounded.

Databricks's long-context RAG research measured the cost-curve cross-over points; the rough rule of thumb: above ~30K relevant-context tokens per query, RAG always wins.

Knowledge — the firm-level layer

Memory is per-agent. Knowledge is per-firm.

A 50-person compliance firm has documents that every agent should be able to draw on: the firm's review playbook, the current regulator-position memos, the boilerplate templates, the brand voice guide. Building those into each agent's semantic memory is wasteful (N copies); leaving them at the firm level lets every agent draw on them with a single lookup.

The Knowledge primitive (Pillar P1) is where this lives. Documents are versioned, tagged with confidentiality classes, and selectively exposed to agents based on their role.

Why this matters for compliance: under ISO/IEC 42001, an organisation must document the behaviour boundaries of its AI systems. Knowledge is where those boundaries are encoded — and where they're auditable.

Vector DBs — the 2026 short list

For the semantic-memory layer specifically, four options dominate as of 2026:

Pinecone (docs) — managed, serverless, the default if you don't want to think about it. Pricing: pay-per-query + storage.
Weaviate (docs) — open-source + managed-cloud option; strong on hybrid (vector + keyword) search.
pgvector (repo) — Postgres extension; right if you're already on Postgres and want a single data plane.
Cloudflare Vectorize (docs) — edge-local; right for low-latency global use cases.

The choice is dominated by: existing data-plane (don't add a database), latency profile (edge-local vs region-local), and operational appetite (managed vs self-host). Capability differences across the top 4 are small enough not to drive the decision.

The which-vector-db-for-agents satellite walks through the decision tree with worked examples; the vector-db-cost-calculator models the unit economics at scale.

Context engineering — the new sub-discipline

How you structure the LLM's input determines output quality more than which model you use. Three patterns matter:

Stable-first ordering. Put the cacheable parts of the prompt first (system message, firm knowledge, role context). Put the task-specific parts last. Anthropic's caching reads the first matching prefix; same with OpenAI. Order matters.
Cite-as-you-go. Every retrieved fact should land in context with its source attribution (<source id="policy-3.2">...</source>). Models reliably preserve those tags in output. The audit trail builds itself.
Strip noise. Long context isn't free. Retrieved chunks should be the most-relevant sub-paragraph, not the whole document. The Memory primitive supports tiered retrieval (paragraph → section → document) for this.

This isn't a new field — Anthropic's context-engineering essay is the most-cited canonical writeup. The pattern is: think of the LLM as a system whose behaviour you tune by the structure of the input, not just the content.

Counter-narrative: "RAG will die"

The strong-form version: context windows will keep growing, costs will keep dropping, and within 3 years no one will bother with retrieval. The weak-form version: RAG remains a tool for audit + cost, but a smaller part of the stack.

The weak-form is right. RAG's role will narrow but not vanish. Three reasons it persists:

Audit requirements (NIST, EU AI Act, SOC 2, ISO 42001) demand traceable citations. Context-stuffing produces opaque generations.
Freshness matters. A vector store updated nightly serves up-to-date facts; a context-stuffed prompt is stale by design.
Privacy isolation. RAG lets you control which documents land in which context — important when the firm's tenant boundaries map to confidentiality classes.

Frequently asked questions

Q: How does AgentsBooks store memory?
A: Working memory = LLM context + substrate scratchpad. Episodic memory = Cloud Firestore (append-only event log per agent). Semantic memory = pluggable vector store (Pinecone by default; Cloudflare Vectorize for edge use cases; pgvector for SQL-native deployments).

Q: What about long-term memory across model upgrades?
A: Episodic + semantic memory are model-agnostic. Switching from Claude Opus 4.6 to 4.7 (or to GPT, Gemini) doesn't touch them. Only working memory is per-call.

Q: How does this map to the 8 primitives?
A: Memory is per-agent. Knowledge is per-firm. Both are first-class primitives in the substrate, with their own storage, their own access control, and their own audit trail.

Want to see memory + knowledge working in practice? Build a memory-aware agent — start free →

🚀 Ready to build this yourself?

Create the agent described in this article in under 2 minutes — no code required.

Try It Free → Book a Demo

memory knowledge rag vector-db context-engineering pillar

Playbooks

Turn this into a working agent

Browse all playbooks →

Build a Student-Tutor Agent for Educators

Video

Educator Beginner

Build a Student-Tutor Agent for Educators

Tessa answers student questions 24/7 from your curriculum, escalates the genuinely hard ones, and never lectures.

7 min chatpublic profile

Build a Story-Teller Agent for Content Creators

Video

Content Creator Beginner

Build a Story-Teller Agent for Content Creators

Spin up Mira — a serial-fiction co-writer who drafts a fresh chapter every morning, holds the cast and lore in long-term memory, and publishes straight to your feed.

7 min chatfeedpublic profile

Build an Outbound Prospector for Founders

Video

Salesperson Intermediate

Build an Outbound Prospector for Founders

Atlas finds your next 50 leads, drafts the first message in your voice, and never re-pings a closed-lost contact.

8 min linkedinemail

Ready to build this agent?

Setup takes less than 2 minutes. No coding required.

Start Building Free →

← Back to Blog

Three layers, each used distinctly

The 2026 inflection: 1M-token context

Knowledge — the firm-level layer

Vector DBs — the 2026 short list

Context engineering — the new sub-discipline

Counter-narrative: "RAG will die"

Frequently asked questions

Continue Reading

Give Your Agent a Soul: Portable Identity Files Come to AgentsBooks

Vector DB Cost Models: A Buyer's Guide for 2026

RAG vs Context Stuffing: A Decision Tree for 2026

Turn this into a working agent

Build a Student-Tutor Agent for Educators

Build a Story-Teller Agent for Content Creators

Build an Outbound Prospector for Founders

Ready to build this agent?