Office
How to Stop LLM HTTP Connection Hangs: Reuse, Sharding, and Backoff That Work in Production
High-intent operations guide: prevent inference stalls by combining HTTP connection reuse with sharded workers, bounded concurrency, and retry backoff.
Published: 2026-01-01 · Last updated: 2026-02-10
Estimated read time: 5 min
Connection reuse is necessary but not sufficient. We saw meaningful stability gains only when reuse was paired with sharding and bounded per-worker concurrency, so one overloaded request loop could not stall the whole run.
A single-worker architecture can reuse sockets and still hang under burst load if admission pressure, retries, and stream handling all contend in one hot path. Sharding reduces this blast radius: each worker owns a smaller queue and a smaller socket pool, and failures are isolated instead of cascading.
Verified production-style context
run_id: newgen_full_20260201_030151_sharded
shard_count: 8
proof_minute_utc: 2026-02-01T03:44:00Z
peak_aggregate_tpm_total: 738652
peak_completion_tpm_out: 364704
peak_prompt_tpm_in: 373948
events_sha256: 50406ae6e24fcf9a2287c61cf211d5c7333c9396d63dc18130816be59cdf1a4f
The important signal is not only peak TPM. It is that admission remained controllable while retries used bounded backoff, which reduced hang-prone behavior.
Implementation pattern
# newgen/run.py
llm_concurrency = int(args.llm_concurrency or os.environ.get("LLM_CONCURRENCY") or 16)
shard_count = int(args.shards or args.shard_count or os.environ.get("SHARD_COUNT") or 1)
# newgen/llm_client.py
self.sem = asyncio.Semaphore(int(concurrency))
backoff = float(os.environ.get("LLM_RETRY_BACKOFF_S", "1.0") or "1.0")
...
await asyncio.sleep(backoff)
backoff = min(backoff * 2.0, 30.0)
What to verify before claiming improvement
- Aggregate throughput across all shards, not one worker.
- TTFT tail behavior (
p95/p99), not only averages. - Retry rate and retry causes before/after.
- Success rate under sustained backlog.
If connections are reused but retries climb and TTFT tails worsen, your pipeline is still under connection pressure.
Internal links
- Up: AI inference cost optimization guide
- Index: /office
- Sibling: Bounded concurrency limits (in-flight caps)
Scope boundary
This note covers: connection-hang mitigation for inference pipelines using reuse + sharding + bounded concurrency + retry backoff.
This note excludes: provider-specific internals, transport tuning beyond application-level controls, and unreproducible benchmark claims.
Build narrative
Follow a coherent path from thesis to lab notes to proof-of-work instead of isolated pages.