Office
Office of AI Throughput & Cost Control
Institutional notes on removing avoidable idle time in AI/LLM inference pipelines to reduce cost per successful completion.
Published: 2026-01-01 · Last updated: 2026-01-04
Institutional notes on removing avoidable idle time in AI/LLM inference pipelines to reduce cost per successful completion.
Phone terminal funnel
Office notes
- Tokens per second (throughput vs idle time)
- HTTP connection reuse for LLM inference
- Bounded concurrency limits (in-flight caps)
- Why AI/LLM inference is slow in production
- How a phone terminal survives reconnects
- Terminal session streaming: minimal message shape
- Output buffering and replay for long-running sessions
- Session IDs and admin controls for browser terminals
- Mobile terminal rotation: resize without corruption
Office Briefings (stub)
Office Briefings are paid, evidence-backed decision aids for specific throughput and cost-control questions. They are scoped, non-promissory, and designed to be reviewed by a senior engineer before adoption.
Scope boundary
This index covers Office materials on throughput and cost control via idle-time elimination. That includes both:
- inference pipeline controls (connections, concurrency, backpressure)
- operator workstation surfaces that reduce recovery friction (persistent sessions, reconnect, replay)
It excludes: training and fine-tuning, prompt/content strategy, vendor pricing claims, and benchmark numbers that are not backed by a reproducible harness and raw measurements.