BasicAgent
LLM Observability for Agent Workflows
LLM Observability for Agent Workflows — What to log, measure, and alert on for long-running agent pipelines (cost, latency, retries, tool calls, retrieval, quality).
Observability is how you keep agent pipelines stable under real load:
- budget spikes
- rate limits
- slow tools
- retrieval drift
- silent quality regressions
The three layers to observe
- Transport (HTTP)
- request latency, timeouts, retry counts, 429/5xx rates
- Workflow (stages)
- per-stage duration, pass/fail, fallback activation, queue depth
- Quality
- evaluation gate pass rate, drift indicators, hallucination indicators
What to log (practical)
For every stage/span:
- identifiers:
run_id,span_id,stage,version - timing: start/end, provider latency
- cost: tokens, unit cost, cumulative budget
- retries: attempts, backoff, retry-after
- tool calls: tool name, duration, failure
- retrieval: corpus id/version, top_k, doc ids, scores
How this connects to audit trails
An audit log is the raw record. Observability is the derived view:
- dashboards
- alerts
- anomaly detection
Start with the audit schema: /tools/llm-audit-log-schema/