Office

How to Stop LLM HTTP Connection Hangs: Reuse, Sharding, and Backoff That Work in Production

High-intent operations guide: prevent inference stalls by combining HTTP connection reuse with sharded workers, bounded concurrency, and retry backoff.

Published: 2026-01-01 · Last updated: 2026-02-10

Estimated read time: 5 min

Connection reuse is necessary but not sufficient. We saw meaningful stability gains only when reuse was paired with sharding and bounded per-worker concurrency, so one overloaded request loop could not stall the whole run.

A single-worker architecture can reuse sockets and still hang under burst load if admission pressure, retries, and stream handling all contend in one hot path. Sharding reduces this blast radius: each worker owns a smaller queue and a smaller socket pool, and failures are isolated instead of cascading.

Verified production-style context

run_id: newgen_full_20260201_030151_sharded
shard_count: 8
proof_minute_utc: 2026-02-01T03:44:00Z
peak_aggregate_tpm_total: 738652
peak_completion_tpm_out: 364704
peak_prompt_tpm_in: 373948
events_sha256: 50406ae6e24fcf9a2287c61cf211d5c7333c9396d63dc18130816be59cdf1a4f

The important signal is not only peak TPM. It is that admission remained controllable while retries used bounded backoff, which reduced hang-prone behavior.

Implementation pattern

# newgen/run.py
llm_concurrency = int(args.llm_concurrency or os.environ.get("LLM_CONCURRENCY") or 16)
shard_count = int(args.shards or args.shard_count or os.environ.get("SHARD_COUNT") or 1)

# newgen/llm_client.py
self.sem = asyncio.Semaphore(int(concurrency))
backoff = float(os.environ.get("LLM_RETRY_BACKOFF_S", "1.0") or "1.0")
...
await asyncio.sleep(backoff)
backoff = min(backoff * 2.0, 30.0)

What to verify before claiming improvement

Aggregate throughput across all shards, not one worker.
TTFT tail behavior (p95/p99), not only averages.
Retry rate and retry causes before/after.
Success rate under sustained backlog.

If connections are reused but retries climb and TTFT tails worsen, your pipeline is still under connection pressure.

Internal links

Up: AI inference cost optimization guide
Index: /office
Sibling: Bounded concurrency limits (in-flight caps)

Scope boundary

This note covers: connection-hang mitigation for inference pipelines using reuse + sharding + bounded concurrency + retry backoff.

This note excludes: provider-specific internals, transport tuning beyond application-level controls, and unreproducible benchmark claims.

Build narrative

Follow a coherent path from thesis to lab notes to proof-of-work instead of isolated pages.

Step 1

Systems thesis

High-level intelligence system framing and architecture.

Step 2

Lab briefings

Chronological notes that capture build decisions and constraints.

Step 3

Governance controls

Risk, policy, and reliability controls around autonomous workflows.

Step 4

Execution alignment

Translate public proof-of-work into a private implementation track.