Office

Fix Slow Enterprise LLM Inference by Solving Control-Plane Bottlenecks

Diagnose and fix production AI slowness by separating model speed from control-plane failures, then validate gains with aggregate TPM and tail-latency metrics.

Published: 2026-01-01 · Last updated: 2026-02-09

Estimated read time: 6 min

Most enterprise teams blame model speed first. In production, slowdowns usually come from control-plane behavior: queueing, connection churn, admission gaps, retry loops, and post-processing stalls. If those buckets dominate, a faster model will not fix your incident profile.

The practical tell is this: queue depth remains high, but in-flight execution is unstable and time-to-first-token tails spike. That usually means your system is not work-conserving under load. In our own sharded run, aggregate completion throughput reached 373,277 tokens/min in one minute across 8 shards, while single-shard peaks sat much lower; this is why aggregate telemetry matters more than per-worker snapshots.

Why sharding changed hang behavior

With one large worker, connection pressure and retries accumulate in a single queue and socket pool. Sharding spreads load across workers, each with bounded in-flight concurrency and explicit backoff. That lowers the probability that one overloaded loop stalls the entire pipeline.

# newgen/run.py
llm_concurrency = int(args.llm_concurrency or os.environ.get("LLM_CONCURRENCY") or 16)
shard_count = int(args.shards or args.shard_count or os.environ.get("SHARD_COUNT") or 1)
# newgen/llm_client.py
self.sem = asyncio.Semaphore(int(concurrency))
backoff = float(os.environ.get("LLM_RETRY_BACKOFF_S", "1.0") or "1.0")
...
await asyncio.sleep(backoff)
backoff = min(backoff * 2.0, 30.0)

The metrics that keep teams honest

  • aggregate_completion_tpm across all shards
  • p95/p99 time_to_first_token
  • retry rate and retry causes
  • success rate under load
  • queue backlog vs in-flight utilization

If throughput gains come with rising retries or tail latency, the system is still unstable. For executive reporting, publish one hash-verifiable minute sample from raw events and keep the exact run ID and file hash.

Scope boundary

This note covers: production bottleneck diagnosis for inference-time systems and why sharding plus bounded concurrency improves stability.

This note excludes: model quality claims, training/fine-tuning, and benchmark numbers without reproducible event provenance.