An audit-grade trail isn't a transcript. It's a structured artefact that lets a regulator or auditor answer "why did the agent do that, and on what basis?" without inferring. Most agent frameworks ship a transcript and call it an audit log. This essay shows the four-tuple that actually qualifies.
The four-tuple
For every agent decision worth auditing, the trail captures:
- Intent — what the agent was trying to accomplish on this call. Encoded as a structured field, not as free text.
- Evidence — the inputs the agent drew on. Includes retrieved Knowledge documents (with IDs), prior Memory items, the principal's request payload.
- Decision — the structured output. Not the free-text reply — the typed decision object (
{verdict: "approve", risk_score: 0.34, reasons: [...]}). - Confidence — the agent's self-reported confidence in the decision, plus the model's logprob distribution if available.
This four-tuple is what an auditor can query. "Show me every decision in Q2 2026 where confidence was <0.7 but the verdict was 'approve'" — answerable in seconds against a four-tuple log. Unanswerable against a transcript.
Why this satisfies the regimes
The mapping to specific clauses:
- NIST AI RMF MEASURE-2.7 (TEVV — test, evaluation, verification, validation) — TEVV requires structured outcomes. Four-tuple gives them.
- EU AI Act Art. 12 (logging) — "automatically generated logs sufficient to trace decisions". Transcript ≠ traceable decision; four-tuple = traceable decision.
- SOC 2 Processing Integrity PI1.4 — "system processing complete, valid, accurate". The four-tuple is what an attestor inspects.
- ISO/IEC 42001 Clause 9 (performance evaluation) — same.
How the substrate emits it
In the AgentsBooks substrate (Pillar P1), the four-tuple emits as a side-effect of operating. Each Heart task wraps the LLM call in an audit_decorator that captures:
- Intent: from the task definition's
goalfield. - Evidence: from Memory's retrieval log + the inbound A2A/event payload.
- Decision: from the agent's typed output schema (defined in the task).
- Confidence: from the model's confidence reporting (when available) + a self-reported
confidencefield in the output schema.
The four-tuple lands in Episodic memory (Pillar P8) and is exposed to the operator dashboard via a structured query.
What to leave OUT of the audit log
Three things commonly bloat audit logs without adding compliance value:
- Full chain-of-thought. Reasoning traces are useful for debugging, not for auditing. Keep them in a separate diagnostic store with shorter retention.
- Raw model API metadata (response IDs, region, etc.) beyond what's needed for cost reconciliation.
- Repeated cacheable context. Hash the prompt + cache flags; don't store the full 50K-token prompt for every call.
The audit log should be queryable in <500ms for any rolling 90-day window. If it's slower, you've stored too much.
FAQ
Q: How long should the audit trail be retained?
A: Regulator-dependent. EU AI Act presumes 6 months minimum for high-risk systems; some financial regimes require 7 years. The substrate supports tiered retention (hot/warm/cold) so longer windows don't blow up query cost.
Q: Can the auditor query the trail directly?
A: With a tenant-scoped read-only token, yes. AgentsBooks exposes an /audit/decisions query endpoint that takes a filter spec and returns the four-tuples. Most attestation engagements run from this directly.
Q: How does this relate to the rest of the compliance pillar?
A: This spoke is the evidence layer. The P4 pillar is the control layer (NIST + EU + SOC2 + ISO mapping). The substrate emits both.
Need audit-grade agent behaviour? Start free →