BasicAgent

LLM Observability for Agent Workflows

LLM Observability for Agent Workflows — What to log, measure, and alert on for long-running agent pipelines (cost, latency, retries, tool calls, retrieval, quality).

Observability is how you keep agent pipelines stable under real load:

  • budget spikes
  • rate limits
  • slow tools
  • retrieval drift
  • silent quality regressions

The three layers to observe

  1. Transport (HTTP)
    • request latency, timeouts, retry counts, 429/5xx rates
  2. Workflow (stages)
    • per-stage duration, pass/fail, fallback activation, queue depth
  3. Quality
    • evaluation gate pass rate, drift indicators, hallucination indicators

What to log (practical)

For every stage/span:

  • identifiers: run_id, span_id, stage, version
  • timing: start/end, provider latency
  • cost: tokens, unit cost, cumulative budget
  • retries: attempts, backoff, retry-after
  • tool calls: tool name, duration, failure
  • retrieval: corpus id/version, top_k, doc ids, scores

How this connects to audit trails

An audit log is the raw record. Observability is the derived view:

  • dashboards
  • alerts
  • anomaly detection

Start with the audit schema: /tools/llm-audit-log-schema/

Create account

Create account