Office

What LLM Concurrency Cap Should You Run to Max Throughput Without Retry Storms?

Solve the exact cap-setting question for production LLM systems: choose in-flight concurrency that maximizes throughput while preserving success rate and preventing retry cascades.

Published: 2026-01-01 · Last updated: 2026-02-10

Estimated read time: 6 min

The practical rule is to run the highest in-flight cap that keeps success rate stable while retries and timeout tails stay controlled. Anything above that knee is false throughput because you are paying for churn.

In our sharded runs, the stable baseline was per-worker bounded concurrency plus distributed admission. A common pattern is LLM_CONCURRENCY=16 with multiple shards, which keeps queues full without concentrating pressure into one loop.

llm_concurrency = int(args.llm_concurrency or os.environ.get("LLM_CONCURRENCY") or 16)
self.sem = asyncio.Semaphore(int(concurrency))

Cap selection should be measured under fixed request mix. Increase cap in small steps, keep backlog constant, and stop when retries rise or success rate dips. This turns cap-setting into a deterministic sweep instead of intuition.

accept_if:
- queue_backlog > 0 and in_flight stays near cap
- retry_rate does not rise
- success_rate does not regress
reject_if:
- retry/timeouts trend up as cap rises

If you run tools and model calls in one shared pool, slow tools can steal model capacity and make the cap look unstable. Keep separate budgets where possible and publish both throughput and reliability metrics in the same chart so leadership can see the true knee.

Unbounded concurrency: “let the runtime handle it” becomes “retry storm under load.”
Global cap across unrelated traffic: a single hot path consumes all slots and starves other routes; use per-host/provider caps when needed.
Holding slots while waiting on tools: if the model slot is occupied while the system waits on a slow tool tier, model throughput drops and queueing grows.
No cancellation: requests that have exceeded their deadline keep consuming slots, preventing work-conserving scheduling for valuable requests.
No retry budget: retries compound overload and can turn a transient error rate into a sustained incident.

Verification checklist

Log requirements:

cap value(s) and per-target cap configuration.
Scheduler timestamps: queued_at, admitted_at, released_at.
in_flight sampled over time; queue depth over time.
Retry counts and causes; success/failure outcomes.

Derived checks:

Work-conserving check: when queue>0, does in_flight stay near cap, or does it sag due to avoidable gaps?
Stability check: as cap increases, do retries/timeouts increase? If yes, cap is exceeding stable capacity or budgets are missing.

Acceptance criteria:

Increasing cap improves throughput up to a point, then stops improving; the selected cap is below the point where retry rate rises.
Success rate does not regress; “throughput” that increases failures is not throughput.

Internal links

Up: AI inference cost optimization guide
Index: /office
Sibling: HTTP connection reuse for LLM inference

Scope boundary

This note covers: bounded in-flight concurrency limits for streaming LLM inference calls, including per-target caps and verification from scheduler telemetry.

This note excludes: distributed queue design, global capacity planning, and any provider-specific “max concurrency” claims that are not derived from your own measurements and documented limits.

Build narrative

Follow a coherent path from thesis to lab notes to proof-of-work instead of isolated pages.

Step 1

Systems thesis

High-level intelligence system framing and architecture.

Step 2

Lab briefings

Chronological notes that capture build decisions and constraints.

Step 3

Governance controls

Risk, policy, and reliability controls around autonomous workflows.

Step 4

Execution alignment

Translate public proof-of-work into a private implementation track.