Office
Bounded Concurrency Limits for LLM Inference
How to choose bounded concurrency limits (in-flight caps) for LLM inference using measurement, stability criteria, and per-target controls.
Published: 2026-01-01 · Last updated: 2026-01-01
Bounded concurrency limits (in-flight caps) are the control that converts bursts into observable queueing instead of unbounded overload. The correct cap is the highest value that keeps the system work-conserving under backlog without increasing retry rate or decreasing success rate.
Mechanism
Define the unit you are capping:
- In-flight request: admitted and occupying a slot until the request is released (stream ends or is canceled).
- Cap: a hard limit on in-flight requests, enforced before the network call begins.
The cap serves two purposes:
- Utilization: with backlog, keep
in_flight ≈ capso you are not leaving capacity idle. - Stability: prevent overload that increases latency variance, timeouts, 429/5xx responses, and retries.
Rate limits are multi-dimensional. A cap that avoids request-per-minute limits can still violate token-per-minute limits (or concurrent-stream limits), which then shows up as longer TTFT, more 429s, or increased timeouts. Treat the cap as one control in a system that must respect all configured budgets.
Selection is measurement-driven:
- If
capis too low, you will seegap_fractionrise: queue>0 but in_flight<cap because admission is constrained and the system is not saturating. - If
capis too high, you will see retry rate rise and success rate fall, often followed by higher tail latency and more timeouts.
A practical selection procedure is to treat cap as a tunable parameter and sweep it under a fixed backlog:
- hold the request mix constant
- increase
capin small steps - stop increasing when success rate drops or retries/timeouts increase
- choose a cap below the instability threshold (the “knee” of the curve)
In agent systems, separate caps are often required:
cap_model: in-flight model calls (streaming)cap_tools: concurrent tool calls (DB, retrieval, external APIs)
Without separation, tool slowness can consume model capacity indirectly (e.g., by holding admission slots longer than intended) or can create backlogs that increase tail latency until timeouts trigger retries.
Minimal implementation pattern:
- A semaphore around model-call admission.
- A bounded queue behind it.
- Cancellation and retry budgets so the cap is not defeated by “stuck” requests.
Illustrative sketch (per-target cap):
import asyncio
caps = {"provider_a": asyncio.Semaphore(14)} # illustrative
async def run_call(provider, request):
async with caps[provider]:
return await client.chat(request, stream=True)
Common failure modes
- Unbounded concurrency: “let the runtime handle it” becomes “retry storm under load.”
- Global cap across unrelated traffic: a single hot path consumes all slots and starves other routes; use per-host/provider caps when needed.
- Holding slots while waiting on tools: if the model slot is occupied while the system waits on a slow tool tier, model throughput drops and queueing grows.
- No cancellation: requests that have exceeded their deadline keep consuming slots, preventing work-conserving scheduling for valuable requests.
- No retry budget: retries compound overload and can turn a transient error rate into a sustained incident.
How to verify
Log requirements:
capvalue(s) and per-target cap configuration.- Scheduler timestamps:
queued_at,admitted_at,released_at. in_flightsampled over time; queue depth over time.- Retry counts and causes; success/failure outcomes.
Derived checks:
- Work-conserving check: when queue>0, does
in_flightstay near cap, or does it sag due to avoidable gaps? - Stability check: as cap increases, do retries/timeouts increase? If yes, cap is exceeding stable capacity or budgets are missing.
Acceptance criteria:
- Increasing cap improves throughput up to a point, then stops improving; the selected cap is below the point where retry rate rises.
- Success rate does not regress; “throughput” that increases failures is not throughput.
Internal links
- Up: AI inference cost optimization guide
- Index: /office
- Sibling: HTTP connection reuse for LLM inference
Scope boundary
This note covers: bounded in-flight concurrency limits for streaming LLM inference calls, including per-target caps and verification from scheduler telemetry.
This note excludes: distributed queue design, global capacity planning, and any provider-specific “max concurrency” claims that are not derived from your own measurements and documented limits.