Office
How to Scale LLM Throughput Past 500k TPM with 8-Shard Concurrency
Answer the exact throughput question for production LLM systems: measure aggregate tokens per minute correctly, shard workers, and reduce HTTP hang risk.
Published: 2026-01-01 · Last updated: 2026-02-09
Estimated read time: 6 min
The core question for serious operators is simple: are you measuring one worker, or the entire system? For LLM operations, the metric that maps to delivery speed is aggregate tokens per minute (TPM) across all shards under real retry, timeout, and concurrency settings.
We validated this on newgen_full_20260201_030151_sharded by aggregating per-call token records over one-minute buckets across 8 shards. Peak observed minute exceeded 500k TPM total:
run_id: newgen_full_20260201_030151_sharded
minute_utc: 2026-02-01T03:44:00Z
shard_count: 8
calls_in_minute: 616
completion_tpm_out: 364704
prompt_tpm_in: 373948
tpm_total: 738652
events_sha256: 50406ae6e24fcf9a2287c61cf211d5c7333c9396d63dc18130816be59cdf1a4f
This resolves the usual confusion: single-shard numbers can look mediocre while aggregate throughput is strong. In this run, another minute reached 373277 completion TPM out, but total TPM was much higher once prompt-side token flow was included.
Sharding worked because it reduced queue pressure per worker and prevented one overloaded loop from stalling the entire run. Combined with bounded in-flight concurrency and exponential backoff, it materially reduced hang-prone behavior under load.
llm_concurrency = int(args.llm_concurrency or os.environ.get("LLM_CONCURRENCY") or 16)
shard_count = int(
args.shards
or args.shard_count
or os.environ.get("SHARD_COUNT")
or 1
)
self.sem = asyncio.Semaphore(int(concurrency))
backoff = float(os.environ.get("LLM_RETRY_BACKOFF_S", "1.0") or "1.0")
...
await asyncio.sleep(backoff)
backoff = min(backoff * 2.0, 30.0)
If you want trustworthy throughput claims, enforce three rules: publish aggregate TPM across all shards, label token type (completion, prompt, or both), and attach a reproducible hash for the source event stream.
- Use aggregate tokens per minute across all shards, not a single worker.
- Report whether numbers are
completiontokens only,prompttokens, or both. - Keep hash-verifiable provenance for the event file used to compute metrics.
- Pair throughput with stability checks: retries, TTFT variance, and success rate.
If throughput rises but retries and connection stalls rise faster, you did not improve the system; you moved the bottleneck.
Internal links
- Up: AI inference cost optimization guide
- Index: /office
- Sibling: HTTP connection reuse for LLM inference
- Sibling: Set the right LLM concurrency cap
Scope boundary
This note covers: throughput measurement for sharded LLM pipelines and how bounded concurrency plus retry backoff changes aggregate output.
This note excludes: model quality benchmarking, prompt strategy, training, and unpublished capacity claims without raw metric provenance.
Build narrative
Follow a coherent path from thesis to lab notes to proof-of-work instead of isolated pages.