Skip to content
Deep Dive audit compliance spoke

What an Audit-Grade Trail for Agents Actually Looks Like

An audit-grade trail isn't a transcript. It's a structured artefact that lets a regulator or auditor answer "why did the agent do that, and on what basis?" without inferring. Most agent frameworks ship a transcript and call it an audit log. This essay shows the four-tuple that actually qualifies.

The four-tuple

For every agent decision worth auditing, the trail captures:

  1. Intent — what the agent was trying to accomplish on this call. Encoded as a structured field, not as free text.
  2. Evidence — the inputs the agent drew on. Includes retrieved Knowledge documents (with IDs), prior Memory items, the principal's request payload.
  3. Decision — the structured output. Not the free-text reply — the typed decision object ({verdict: "approve", risk_score: 0.34, reasons: [...]}).
  4. Confidence — the agent's self-reported confidence in the decision, plus the model's logprob distribution if available.

This four-tuple is what an auditor can query. "Show me every decision in Q2 2026 where confidence was <0.7 but the verdict was 'approve'" — answerable in seconds against a four-tuple log. Unanswerable against a transcript.

Why this satisfies the regimes

The mapping to specific clauses:

  • NIST AI RMF MEASURE-2.7 (TEVV — test, evaluation, verification, validation) — TEVV requires structured outcomes. Four-tuple gives them.
  • EU AI Act Art. 12 (logging) — "automatically generated logs sufficient to trace decisions". Transcript ≠ traceable decision; four-tuple = traceable decision.
  • SOC 2 Processing Integrity PI1.4 — "system processing complete, valid, accurate". The four-tuple is what an attestor inspects.
  • ISO/IEC 42001 Clause 9 (performance evaluation) — same.

How the substrate emits it

In the AgentsBooks substrate (Pillar P1), the four-tuple emits as a side-effect of operating. Each Heart task wraps the LLM call in an audit_decorator that captures:

  • Intent: from the task definition's goal field.
  • Evidence: from Memory's retrieval log + the inbound A2A/event payload.
  • Decision: from the agent's typed output schema (defined in the task).
  • Confidence: from the model's confidence reporting (when available) + a self-reported confidence field in the output schema.

The four-tuple lands in Episodic memory (Pillar P8) and is exposed to the operator dashboard via a structured query.

What to leave OUT of the audit log

Three things commonly bloat audit logs without adding compliance value:

  1. Full chain-of-thought. Reasoning traces are useful for debugging, not for auditing. Keep them in a separate diagnostic store with shorter retention.
  2. Raw model API metadata (response IDs, region, etc.) beyond what's needed for cost reconciliation.
  3. Repeated cacheable context. Hash the prompt + cache flags; don't store the full 50K-token prompt for every call.

The audit log should be queryable in <500ms for any rolling 90-day window. If it's slower, you've stored too much.

FAQ

Q: How long should the audit trail be retained?
A: Regulator-dependent. EU AI Act presumes 6 months minimum for high-risk systems; some financial regimes require 7 years. The substrate supports tiered retention (hot/warm/cold) so longer windows don't blow up query cost.

Q: Can the auditor query the trail directly?
A: With a tenant-scoped read-only token, yes. AgentsBooks exposes an /audit/decisions query endpoint that takes a filter spec and returns the four-tuples. Most attestation engagements run from this directly.

Q: How does this relate to the rest of the compliance pillar?
A: This spoke is the evidence layer. The P4 pillar is the control layer (NIST + EU + SOC2 + ISO mapping). The substrate emits both.


Need audit-grade agent behaviour? Start free →

🚀 Ready to build this yourself?

Create the agent described in this article in under 2 minutes — no code required.

Try It Free → Book a Demo
Share this article
𝕏 Share 🔗 LinkedIn
Playbooks

Turn this into a working agent

Browse all playbooks →

Ready to build this agent?

Setup takes less than 2 minutes. No coding required.

Start Building Free →
Image
Copy link
X
LinkedIn
Reddit
Download