BasicAgent
LLM Workflow Reliability: Concurrency, Timeouts, Retries, Traceability
Reliability playbook for LLM and agent pipelines—bounded concurrency, hard timeouts, jittered retries, sandbox-to-promote gating, and traceable spans grounded in Multi-Agent-COT-Prompting.
Long-running agent pipelines (30–180 minutes) need hard controls, not hope. Reliability is a systems problem: concurrency caps, deterministic timeouts, retries with jitter, streaming-safe parsing, and traceable runs with run IDs and Layered-CoT checkpoints.
Reliability checklist
- Bounded concurrency: cap in-flight calls to avoid 429 storms.
- Hard timeouts: connect, read, and total deadlines.
- Retry + jitter: only for transient errors (429/5xx/network).
- Streaming-safe: parse SSE (
data: {...}anddata: [DONE]). - Sandbox → promote: run experiments in sandbox, promote only validated outputs.
- Audit + traces: keep run_id + trace_id per span.
Code: reliable client call with gates
import asyncio, os
from nohang_client import NoHangClient
async def run_once(prompt: str):
async with NoHangClient(
base_url="https://api.openai.com/v1",
api_key=os.environ["OPENAI_API_KEY"],
default_model="gpt-4o-mini",
max_concurrent=12,
timeout_total=180,
max_retries=3,
) as client:
text = await client.chat(
[{"role":"user","content":prompt}],
stream=True,
max_tokens=256,
)
return text
# Sandbox → promote guardrail
async def guarded(prompt):
draft = await run_once(prompt)
verdict = validate(draft) # your Layered-CoT or eval here
return draft if verdict.ok else fallback_response()
asyncio.run(guarded("Summarize today’s incidents in 3 bullets."))
How it maps to the COT repo
- Layered-CoT: use layered validation on sandbox outputs before promotion.
- Orchestrator: route intents to the right agent and attach risk classes (see
simple-orchestrator/orchestrator.py). - Sandbox-promote pattern: separate creative generation from validated delivery.
Observability hooks
- Emit per-call metrics: latency_ms, retries, timeout_count.
- Trace spans: role, prompt template version, model, token counts.
- Log verdicts from validation (pass/fail + reason). See
/ai-observability/for schema ideas.
Related pages
- Audit trail:
/llm-audit-trail-agent-pipelines/ - Observability:
/ai-observability/ - Policy and controls:
/ai-governance/ - Prompt safety:
/system-prompting/