Office

Tokens Per Second (LLM Inference)

What tokens per second measures, how it differs from end-to-end throughput, and how to use it to detect idle time in LLM inference pipelines.

Published: 2026-01-01 · Last updated: 2026-01-01

Tokens per second is a throughput measure: how fast useful tokens are produced by a model or pipeline. It is only actionable when you also measure when the system is not producing tokens despite having work (queue>0) and can separate generation time from connection, queueing, and orchestration overhead.

Mechanism

There are at least three “tokens per second” metrics; if you mix them, you will optimize the wrong thing:

  • Stream TPS (per request): output_tokens / t_stream, where t_stream = first_token → last_token. This mostly reflects model + server-side generation speed once the stream starts.
  • End-to-end TPS (per request): output_tokens / t_total, where t_total = admitted → released (or user-visible start → completion). This includes time to first token, stalls, post-processing, and any idle gaps caused by orchestration.
  • System TPS (aggregate): sum(output_tokens) / window_seconds across all successful requests. This is the metric that drives time-billed cost per completion because it correlates with completions/hour when success rate is stable.

The failure mode you are trying to detect is idle time: the system has a backlog, but it is not keeping in-flight streams at the cap. Define an explicit gap metric:

  • Let gap_time be total time in a window where queue_depth > 0 and in_flight < cap.
  • Let gap_fraction = gap_time / window_time.

If gap_fraction is non-trivial, your throughput problem is likely not “the model is slow.” It is admission, connection churn, blocked orchestration, or downstream stalls that prevent work-conserving scheduling.

Common failure modes

  • Counting the wrong tokens: provider “token usage” often includes input tokens; “tokens per second” usually refers to output token emission rate. Make it explicit whether you mean output, input, or total.
  • Tokenizer/model mismatch: token counts differ across models and tokenizers; do not compare TPS across providers/models unless you normalize tokenization or use the same accounting source consistently.
  • Measuring only during streaming: a high stream TPS can coexist with poor end-to-end throughput if time-to-first-token is high or if the pipeline has idle gaps between streams.
  • Ignoring failures and retries: if retries duplicate work, “TPS on successful requests” can look good while system-level TPS is poor because capacity is consumed by failed attempts.
  • Sampling bias: measuring only short requests hides tail behavior; long requests tend to surface queueing, stalls, and disconnect handling problems.
  • No stable admission timestamps: if you do not log queued_at, admitted_at, first_token_at, last_token_at, and released_at, you cannot separate queueing from streaming and cannot compute gap_fraction.

How to verify

Minimum logging (per request/span):

  • Identifiers: run_id, span_id, and a stable request identifier.
  • Timestamps: queued_at, admitted_at, first_token_at, last_token_at, released_at.
  • Counts: output_tokens, input_tokens (if available), retry_count, and retry causes.
  • Scheduler state: cap (configured) and instantaneous in_flight sampled on a fixed cadence.

Derived metrics to compute:

  • t_queue = admitted_at - queued_at
  • t_first_token = first_token_at - admitted_at
  • t_stream = last_token_at - first_token_at
  • t_in_flight = released_at - admitted_at
  • stream_tps = output_tokens / t_stream
  • end_to_end_tps = output_tokens / t_in_flight
  • gap_fraction from (queue_depth > 0) && (in_flight < cap)

Acceptance criteria (do not skip):

  • System-level throughput improves without reducing success rate.
  • gap_fraction decreases when backlog exists (proof that you removed idle time, not just shifted it).
  • Retry rate does not increase; if it does, throughput gains are likely unstable.

Scope boundary

This note covers: definitions and measurement of tokens per second, and how to interpret TPS alongside queueing and in-flight utilization to detect idle time.

This note excludes: prompt/content changes, model selection, training/fine-tuning, benchmark claims, and any “best TPS” numbers not derived from your own logs.