BasicAgent

LLM Audit Trail: Provenance, Replayability, Evidence Bundles

Auditable LLM pipelines—provenance IDs, replay controls, evidence bundles, and Layered-CoT verdicts tied to spans for defensible outputs.

For any answer you ship, you must prove: where it came from, how to replay it, what evidence supports it, and how it stays correct over time. This page shows the minimum audit schema and how to wire it to sandbox/promote and Layered-CoT from Multi-Agent-COT-Prompting.

What to log (stage-level, not just prompts)

run_id (pipeline), trace_id (end-to-end), span_id + parent_span_id (graph)
stage (semantic step), agent_role, status
timestamps, model/provider, token counts, retries
inputs/outputs (URIs), retrieval set IDs, validation verdicts
hashes/signatures for evidence bundles

Code: one span record

span = {
    "run_id": run_id,
    "trace_id": trace_id,
    "span_id": "validate-rag-1",
    "parent_span_id": "answer-1",
    "stage": "layered_cot_validation",
    "agent_role": "governance_auditor",
    "model": "gpt-4o-mini",
    "status": "pass",
    "latency_ms": 842,
    "retries": 1,
    "retrieval_ids": ["doc:boj:2024-11"],
    "inputs_uri": "s3://evidence/runs/.../inputs.json",
    "outputs_uri": "s3://evidence/runs/.../answer.json",
    "verdict": {"ok": True, "reason": "facts match sources"},
}
write_audit(span)

Reference flow

Ingest and normalize (text/tables/images).
Retrieve with provenance IDs.
Sandbox generation (explore) → Layered-CoT validation → promote.
Export evidence bundle: sources, prompts, outputs, hashes, and validation verdicts.
Keep replay knobs: model, temperature, prompt version, retrieval snapshot.