Office

Why AI/LLM Inference Is Slow

A time-bucket method to diagnose why AI/LLM inference is slow in production and which controls fix which bottlenecks.

Published: 2026-01-01 · Last updated: 2026-01-01

Production inference is slow when non-generation time dominates: queueing, connection setup, blocked orchestration, tool latency, and retries. Production LLM inference throughput collapses when those buckets expand under load while generation time stays roughly constant. The only reliable way to prove the cause is to decompose each run into measurable time buckets (queued → admitted → first token → last token → released) and attribute failures and retries to the same run_id/span_id graph.

Mechanism

A fast model can still produce a slow system because the end-to-end completion time is the sum of multiple buckets:

  • t_queue: waiting to be admitted (backlog, caps)
  • t_conn: connection setup (if not reused)
  • t_first_token: admitted → first token (includes upstream scheduling and provider-side load)
  • t_stream: first token → last token (generation)
  • t_post: parsing, validation, tool routing, persistence
  • t_retry: additional time from retries, restarts, and duplicated work

When engineers say “inference is slow,” they often only look at t_stream (generation). In production, t_queue + t_first_token + t_post + t_retry frequently dominates, especially under bursty load where tail behavior controls user experience.

Two signatures identify control-plane bottlenecks:

  1. High variance in time to first token: p95/p99 t_first_token moves dramatically while t_stream is stable. This often indicates connection churn, queuing at the provider, or local admission gaps.

  2. Idle gaps under backlog: queue depth is non-zero, but in_flight is below the configured cap. This indicates the scheduler, networking, or CPU parsing path is stalling.

Use bucket dominance to triage:

  • If t_queue dominates, you have a backlog/cap or an upstream arrival spike; verify whether in_flight stays near cap when queue>0.
  • If t_first_token dominates, investigate connection reuse, upstream/provider queueing, and admission gaps.
  • If t_post dominates, treat parsing/validation/persistence as a throughput limiter; move heavy work off the socket-read executor and measure CPU time explicitly.
  • If t_retry dominates, fix retry budgets and timeouts before adjusting concurrency.

If you pay for time-billed inference capacity, the cost impact is arithmetic:

  • cost_per_success = C_hour / R_success
  • Any increase in “time not streaming useful tokens” reduces R_success and increases cost per successful completion.

Common failure modes

  • Optimizing the model first: changing model size or quantization without measuring whether t_stream is the dominant bucket.
  • No IDs, no attribution: without run_id/span_id, you cannot connect retries and tool calls to specific slowdowns.
  • Average-only reporting: averages hide tail; p95/p99 drive perceived slowness and incident frequency.
  • Retry amplification: timeouts and retries add load, which increases queueing, which creates more timeouts.
  • Hidden tool bottlenecks: slow retrieval or external APIs make the inference stage idle while it waits, unless you design separate caps and backpressure.

How to verify

Minimum instrumentation:

  • Per span: queued_at, admitted_at, first_token_at, last_token_at, released_at.
  • Token counts: input/output tokens, plus “tokens wasted on failed attempts” if you can attribute them.
  • Retry telemetry: retry cause, attempt count, and whether a retry reused a partial result or restarted from scratch.
  • Scheduler telemetry: cap, in_flight, queue depth over time.

Analysis steps:

  • Plot the distribution of each time bucket (p50/p95) before changing anything.
  • Compute gap_fraction where backlog exists but in_flight < cap.
  • Compare t_first_token variability with connection churn metrics (new connections per success).
  • Confirm that success rate is stable; throughput that increases failures is a reliability regression.

Scope boundary

This note covers: causal decomposition of “slow inference” into time buckets and verification signals for control-plane vs generation bottlenecks.

This note excludes: model benchmarking, training/fine-tuning, and any claims about specific provider performance that are not derived from your own measurements.