Office
Why AI/LLM Inference Is Slow
A time-bucket method to diagnose why AI/LLM inference is slow in production and which controls fix which bottlenecks.
Published: 2026-01-01 · Last updated: 2026-01-01
Production inference is slow when non-generation time dominates: queueing, connection setup, blocked orchestration, tool latency, and retries. Production LLM inference throughput collapses when those buckets expand under load while generation time stays roughly constant. The only reliable way to prove the cause is to decompose each run into measurable time buckets (queued → admitted → first token → last token → released) and attribute failures and retries to the same run_id/span_id graph.
Mechanism
A fast model can still produce a slow system because the end-to-end completion time is the sum of multiple buckets:
t_queue: waiting to be admitted (backlog, caps)t_conn: connection setup (if not reused)t_first_token: admitted → first token (includes upstream scheduling and provider-side load)t_stream: first token → last token (generation)t_post: parsing, validation, tool routing, persistencet_retry: additional time from retries, restarts, and duplicated work
When engineers say “inference is slow,” they often only look at t_stream (generation). In production, t_queue + t_first_token + t_post + t_retry frequently dominates, especially under bursty load where tail behavior controls user experience.
Two signatures identify control-plane bottlenecks:
-
High variance in time to first token: p95/p99
t_first_tokenmoves dramatically whilet_streamis stable. This often indicates connection churn, queuing at the provider, or local admission gaps. -
Idle gaps under backlog: queue depth is non-zero, but
in_flightis below the configured cap. This indicates the scheduler, networking, or CPU parsing path is stalling.
Use bucket dominance to triage:
- If
t_queuedominates, you have a backlog/cap or an upstream arrival spike; verify whetherin_flightstays near cap when queue>0. - If
t_first_tokendominates, investigate connection reuse, upstream/provider queueing, and admission gaps. - If
t_postdominates, treat parsing/validation/persistence as a throughput limiter; move heavy work off the socket-read executor and measure CPU time explicitly. - If
t_retrydominates, fix retry budgets and timeouts before adjusting concurrency.
If you pay for time-billed inference capacity, the cost impact is arithmetic:
cost_per_success = C_hour / R_success- Any increase in “time not streaming useful tokens” reduces
R_successand increases cost per successful completion.
Common failure modes
- Optimizing the model first: changing model size or quantization without measuring whether
t_streamis the dominant bucket. - No IDs, no attribution: without
run_id/span_id, you cannot connect retries and tool calls to specific slowdowns. - Average-only reporting: averages hide tail; p95/p99 drive perceived slowness and incident frequency.
- Retry amplification: timeouts and retries add load, which increases queueing, which creates more timeouts.
- Hidden tool bottlenecks: slow retrieval or external APIs make the inference stage idle while it waits, unless you design separate caps and backpressure.
How to verify
Minimum instrumentation:
- Per span:
queued_at,admitted_at,first_token_at,last_token_at,released_at. - Token counts: input/output tokens, plus “tokens wasted on failed attempts” if you can attribute them.
- Retry telemetry: retry cause, attempt count, and whether a retry reused a partial result or restarted from scratch.
- Scheduler telemetry:
cap,in_flight, queue depth over time.
Analysis steps:
- Plot the distribution of each time bucket (p50/p95) before changing anything.
- Compute
gap_fractionwhere backlog exists butin_flight < cap. - Compare
t_first_tokenvariability with connection churn metrics (new connections per success). - Confirm that success rate is stable; throughput that increases failures is a reliability regression.
Internal links
- Up: AI inference cost optimization guide
- Index: /office
- Sibling: Bounded concurrency limits (in-flight caps)
Scope boundary
This note covers: causal decomposition of “slow inference” into time buckets and verification signals for control-plane vs generation bottlenecks.
This note excludes: model benchmarking, training/fine-tuning, and any claims about specific provider performance that are not derived from your own measurements.