Office
Fix Slow Enterprise LLM Inference by Solving Control-Plane Bottlenecks
Diagnose and fix production AI slowness by separating model speed from control-plane failures, then validate gains with aggregate TPM and tail-latency metrics.
Published: 2026-01-01 · Last updated: 2026-02-09
Estimated read time: 6 min
Most enterprise teams blame model speed first. In production, slowdowns usually come from control-plane behavior: queueing, connection churn, admission gaps, retry loops, and post-processing stalls. If those buckets dominate, a faster model will not fix your incident profile.
The practical tell is this: queue depth remains high, but in-flight execution is unstable and time-to-first-token tails spike. That usually means your system is not work-conserving under load. In our own sharded run, aggregate completion throughput reached 373,277 tokens/min in one minute across 8 shards, while single-shard peaks sat much lower; this is why aggregate telemetry matters more than per-worker snapshots.
Why sharding changed hang behavior
With one large worker, connection pressure and retries accumulate in a single queue and socket pool. Sharding spreads load across workers, each with bounded in-flight concurrency and explicit backoff. That lowers the probability that one overloaded loop stalls the entire pipeline.
# newgen/run.py
llm_concurrency = int(args.llm_concurrency or os.environ.get("LLM_CONCURRENCY") or 16)
shard_count = int(args.shards or args.shard_count or os.environ.get("SHARD_COUNT") or 1)
# newgen/llm_client.py
self.sem = asyncio.Semaphore(int(concurrency))
backoff = float(os.environ.get("LLM_RETRY_BACKOFF_S", "1.0") or "1.0")
...
await asyncio.sleep(backoff)
backoff = min(backoff * 2.0, 30.0)
The metrics that keep teams honest
aggregate_completion_tpmacross all shardsp95/p99 time_to_first_token- retry rate and retry causes
- success rate under load
- queue backlog vs in-flight utilization
If throughput gains come with rising retries or tail latency, the system is still unstable. For executive reporting, publish one hash-verifiable minute sample from raw events and keep the exact run ID and file hash.
Internal links
Scope boundary
This note covers: production bottleneck diagnosis for inference-time systems and why sharding plus bounded concurrency improves stability.
This note excludes: model quality claims, training/fine-tuning, and benchmark numbers without reproducible event provenance.
Build narrative
Follow a coherent path from thesis to lab notes to proof-of-work instead of isolated pages.