Office

HTTP Connection Reuse for LLM Inference

How HTTP connection reuse affects LLM inference throughput and time to first token, plus verification steps and failure modes.

Published: 2026-01-01 · Last updated: 2026-01-01

HTTP connection reuse reduces avoidable connection setup overhead (TCP/TLS handshakes, slow-start effects, and tail variance) and stabilizes time-to-first-token. It matters for throughput because concurrency controls only work when requests can be admitted and start streaming without repeatedly paying a high-variance setup cost.

Mechanism

Connection setup is not “free,” and it is not constant:

  • TCP establishment and TLS negotiation add latency and variability (packet loss, retransmits, and path changes show up as tail).
  • New connections often start with conservative congestion windows; short bursts can be slower than steady-state flows.
  • Load balancers and NATs impose idle timeouts; if you churn connections, you amplify these effects.

For streaming LLM inference (SSE), the control-plane interval that usually benefits is t_first_token:

  • t_first_token = first_token_at - admitted_at
  • When you reuse a warm connection, the portion of t_first_token attributable to connection setup (t_conn) should shrink.

“Correct reuse” is operational, not conceptual:

  • Single long-lived client per process (or per worker), not instantiated per request.
  • Pooling enabled and sized: max idle connections, max total connections, and per-host limits.
  • Clean stream closure: when the SSE stream ends, the response body must be fully closed so the connection returns to the pool.
  • Stable host selection: if you spray requests across many hostnames or disable keep-alive at the proxy, reuse rate will be low regardless of client code.

Configuration is part of correctness. Keep-alive idle timeouts, per-host pool limits, and connect/read/total timeouts must be aligned with intermediaries (LBs, NATs, corporate proxies). Misalignment looks like “reuse works in dev” but collapses under real load into reconnect churn and TTFT variance.

HTTP/2 is not a magic flag. It helps when:

  • you maintain a small number of long-lived connections
  • you actually have concurrent streams to multiplex

If you enable HTTP/2 but still create a new client/connection per request, you will see the costs of churn without the benefits of multiplexing.

Common failure modes

  • Client instantiated per call: new session objects disable pooling by construction.
  • Leaked responses: failing to close the SSE stream keeps sockets occupied; the pool cannot reuse them and will open new ones under load.
  • Keep-alive defeated by proxies: some reverse proxies close upstream connections aggressively; the client thinks it is reusing but the upstream is not.
  • Idle timeouts and reconnect storms: long-running streams can be cut by intermediaries; if reconnect logic has no backoff/budget, reconnect storms look like “random TTFT spikes.”
  • Mixed DNS and hostnames: different hostnames (or frequently changing endpoints) fragment the pool and reduce reuse.

How to verify

Minimum measurements:

  • new_connections_per_success: count new TCP/TLS sessions opened per successful completion.
  • p95(t_first_token) and p95(t_conn) (if you can instrument connection setup time separately).
  • Pool stats, if available (active/idle connections; reuse rate; per-host connection counts).
  • When possible, measure at both the client process and the proxy/LB layer; one layer can hide churn in another.

Practical verification steps:

  • Run a controlled load with a fixed backlog and compare “new connections per success” before/after.
  • Confirm the client closes streams on [DONE] (or equivalent) and on cancellation/timeouts.
  • Inspect system-level socket telemetry (e.g., established connections, SYN rate) to ensure connection churn actually decreased.

Acceptance criteria:

  • New connection rate drops after warm-up without increasing error rate.
  • p95(t_first_token) improves or becomes less variable; if it worsens, you may be leaking connections or fighting proxy timeouts.

Scope boundary

This note covers: connection reuse mechanics for HTTP keep-alive/HTTP2 and SSE streaming in LLM inference clients, and how to verify reuse from telemetry.

This note excludes: provider-side infrastructure claims, kernel-level tuning, and any benchmark numbers not produced by a published harness and raw measurements.