Office

How to Scale LLM Throughput Past 500k TPM with 8-Shard Concurrency

Answer the exact throughput question for production LLM systems: measure aggregate tokens per minute correctly, shard workers, and reduce HTTP hang risk.

Published: 2026-01-01 · Last updated: 2026-02-09

Estimated read time: 6 min

The core question for serious operators is simple: are you measuring one worker, or the entire system? For LLM operations, the metric that maps to delivery speed is aggregate tokens per minute (TPM) across all shards under real retry, timeout, and concurrency settings.

We validated this on newgen_full_20260201_030151_sharded by aggregating per-call token records over one-minute buckets across 8 shards. Peak observed minute exceeded 500k TPM total:

run_id: newgen_full_20260201_030151_sharded
minute_utc: 2026-02-01T03:44:00Z
shard_count: 8
calls_in_minute: 616
completion_tpm_out: 364704
prompt_tpm_in: 373948
tpm_total: 738652
events_sha256: 50406ae6e24fcf9a2287c61cf211d5c7333c9396d63dc18130816be59cdf1a4f

This resolves the usual confusion: single-shard numbers can look mediocre while aggregate throughput is strong. In this run, another minute reached 373277 completion TPM out, but total TPM was much higher once prompt-side token flow was included.

Sharding worked because it reduced queue pressure per worker and prevented one overloaded loop from stalling the entire run. Combined with bounded in-flight concurrency and exponential backoff, it materially reduced hang-prone behavior under load.

llm_concurrency = int(args.llm_concurrency or os.environ.get("LLM_CONCURRENCY") or 16)
shard_count = int(
    args.shards
    or args.shard_count
    or os.environ.get("SHARD_COUNT")
    or 1
)

self.sem = asyncio.Semaphore(int(concurrency))
backoff = float(os.environ.get("LLM_RETRY_BACKOFF_S", "1.0") or "1.0")
...
await asyncio.sleep(backoff)
backoff = min(backoff * 2.0, 30.0)

If you want trustworthy throughput claims, enforce three rules: publish aggregate TPM across all shards, label token type (completion, prompt, or both), and attach a reproducible hash for the source event stream.

Use aggregate tokens per minute across all shards, not a single worker.
Report whether numbers are completion tokens only, prompt tokens, or both.
Keep hash-verifiable provenance for the event file used to compute metrics.
Pair throughput with stability checks: retries, TTFT variance, and success rate.

If throughput rises but retries and connection stalls rise faster, you did not improve the system; you moved the bottleneck.

Internal links

Up: AI inference cost optimization guide
Index: /office
Sibling: HTTP connection reuse for LLM inference
Sibling: Set the right LLM concurrency cap

Scope boundary

This note covers: throughput measurement for sharded LLM pipelines and how bounded concurrency plus retry backoff changes aggregate output.

This note excludes: model quality benchmarking, prompt strategy, training, and unpublished capacity claims without raw metric provenance.

Build narrative

Follow a coherent path from thesis to lab notes to proof-of-work instead of isolated pages.

Step 1

Systems thesis

High-level intelligence system framing and architecture.

Step 2

Lab briefings

Chronological notes that capture build decisions and constraints.

Step 3

Governance controls

Risk, policy, and reliability controls around autonomous workflows.

Step 4

Execution alignment

Translate public proof-of-work into a private implementation track.