Office
Tokens Per Second (LLM Inference)
What tokens per second measures, how it differs from end-to-end throughput, and how to use it to detect idle time in LLM inference pipelines.
Published: 2026-01-01 · Last updated: 2026-01-01
Tokens per second is a throughput measure: how fast useful tokens are produced by a model or pipeline. It is only actionable when you also measure when the system is not producing tokens despite having work (queue>0) and can separate generation time from connection, queueing, and orchestration overhead.
Mechanism
There are at least three “tokens per second” metrics; if you mix them, you will optimize the wrong thing:
- Stream TPS (per request):
output_tokens / t_stream, wheret_stream = first_token → last_token. This mostly reflects model + server-side generation speed once the stream starts. - End-to-end TPS (per request):
output_tokens / t_total, wheret_total = admitted → released(or user-visible start → completion). This includes time to first token, stalls, post-processing, and any idle gaps caused by orchestration. - System TPS (aggregate):
sum(output_tokens) / window_secondsacross all successful requests. This is the metric that drives time-billed cost per completion because it correlates with completions/hour when success rate is stable.
The failure mode you are trying to detect is idle time: the system has a backlog, but it is not keeping in-flight streams at the cap. Define an explicit gap metric:
- Let
gap_timebe total time in a window wherequeue_depth > 0andin_flight < cap. - Let
gap_fraction = gap_time / window_time.
If gap_fraction is non-trivial, your throughput problem is likely not “the model is slow.” It is admission, connection churn, blocked orchestration, or downstream stalls that prevent work-conserving scheduling.
Common failure modes
- Counting the wrong tokens: provider “token usage” often includes input tokens; “tokens per second” usually refers to output token emission rate. Make it explicit whether you mean output, input, or total.
- Tokenizer/model mismatch: token counts differ across models and tokenizers; do not compare TPS across providers/models unless you normalize tokenization or use the same accounting source consistently.
- Measuring only during streaming: a high stream TPS can coexist with poor end-to-end throughput if time-to-first-token is high or if the pipeline has idle gaps between streams.
- Ignoring failures and retries: if retries duplicate work, “TPS on successful requests” can look good while system-level TPS is poor because capacity is consumed by failed attempts.
- Sampling bias: measuring only short requests hides tail behavior; long requests tend to surface queueing, stalls, and disconnect handling problems.
- No stable admission timestamps: if you do not log
queued_at,admitted_at,first_token_at,last_token_at, andreleased_at, you cannot separate queueing from streaming and cannot computegap_fraction.
How to verify
Minimum logging (per request/span):
- Identifiers:
run_id,span_id, and a stable request identifier. - Timestamps:
queued_at,admitted_at,first_token_at,last_token_at,released_at. - Counts:
output_tokens,input_tokens(if available),retry_count, and retry causes. - Scheduler state:
cap(configured) and instantaneousin_flightsampled on a fixed cadence.
Derived metrics to compute:
t_queue = admitted_at - queued_att_first_token = first_token_at - admitted_att_stream = last_token_at - first_token_att_in_flight = released_at - admitted_atstream_tps = output_tokens / t_streamend_to_end_tps = output_tokens / t_in_flightgap_fractionfrom(queue_depth > 0) && (in_flight < cap)
Acceptance criteria (do not skip):
- System-level throughput improves without reducing success rate.
gap_fractiondecreases when backlog exists (proof that you removed idle time, not just shifted it).- Retry rate does not increase; if it does, throughput gains are likely unstable.
Internal links
- Up: AI inference cost optimization guide
- Index: /office
- Sibling: Why AI/LLM inference is slow in production
Scope boundary
This note covers: definitions and measurement of tokens per second, and how to interpret TPS alongside queueing and in-flight utilization to detect idle time.
This note excludes: prompt/content changes, model selection, training/fine-tuning, benchmark claims, and any “best TPS” numbers not derived from your own logs.