How do you optimize latency and streaming for real-time LLMs?

How do you optimize latency and streaming for real-time LLMs?

By: Ari Heljakka

Short answer

A real-time LLM application is any system where time-to-first-token (TTFT) and inter-token latency (ITL) bound the user experience: chat assistants, voice agents, copilots, live translation, RAG search. Five primitives carry most of the latency improvement: continuous batching with PagedAttention, speculative decoding, semantic caching, weight and KV quantization, and tensor parallelism. None of them is free; each shifts latency at the cost of memory, accuracy, infrastructure complexity, or throughput on cold paths. The defensible practice is to measure TTFT, ITL, and quality on a calibrated workload after every change, treat latency budgets as first-class operational objectives, and gate releases on both the latency targets and the quality scores so that one does not silently degrade the other.

Key facts

  • Definition: A set of techniques (batching, decoding, caching, quantization, parallelism) that reduce TTFT and ITL for LLM inference without unacceptable quality loss.
  • When to use: Any user-facing LLM application where perceived response time is part of the product experience.
  • Limitations: Every technique trades latency for some other dimension (memory, accuracy, cost, complexity); no technique is dominant on all of them.
  • Example: Adding continuous batching plus speculative decoding to a chat backend can lift throughput substantially under mixed-length load. vLLM/PagedAttention reports roughly 2 to 4 times higher throughput than state-of-the-art serving systems, while a roughly 40 times gain belongs to a specific static-batching comparison, not a universal expectation.

Key takeaways

  • TTFT and ITL are independent metrics; optimize them with their own techniques.
  • Continuous batching is the highest-leverage single change for most multi-tenant servers.
  • Quantization is free until it is not; calibrate against quality before shipping.
  • Semantic caching shifts latency for the common path and costs nothing for the long tail.
  • Every latency improvement needs a quality regression gate, or the wins are imaginary.

Definition

Real-time LLM inference is the operating regime where the user perceives the time between request and first response token, and between successive tokens. Two metrics define it.

  • Time-to-first-token (TTFT). The time from request submission to the first streamed token. Dominated by prefill (encoding the prompt) and queueing.
  • Inter-token latency (ITL), also TPOT. The time between successive output tokens. Dominated by decode and memory bandwidth.

Useful targets vary by application: voice agents and conversational copilots often target TTFT below 500 ms and ITL below 200 ms; autocomplete, inline copilots, and other extreme interactive surfaces may target TTFT and ITL below 50 ms. Full RAG search and live translation can have tight budgets too, but 50 ms should be treated as an aggressive product-specific target rather than the typical requirement. Batch and offline jobs do not belong in this regime; their constraint is throughput per dollar, not perceived latency.

When this matters

  • Conversational interfaces. Anything where a user is waiting; the perceived quality of the system is bounded by TTFT.
  • Voice agents. End-to-end latency from speech to speech includes ASR, LLM, TTS; the LLM share is typically the largest single contributor.
  • Streaming retrieval-augmented generation. RAG inflates TTFT through retrieval; the budget for the LLM is what is left after the retriever.
  • Copilots and inline assistance. Tab-completion-style features have an ITL budget measured in tens of milliseconds.
  • Multi-tenant production. Throughput under concurrent load matters as much as single-request latency; without batching, the server falls over at modest concurrency.

How it works

Five primitives carry most of the wins. Each has a known mechanism and a known cost.

Continuous batching with PagedAttention

The default batching strategy in many serving stacks waits for a fixed batch to fill before processing, which blocks new requests behind in-flight ones. Continuous batching lets new requests join the in-flight batch as soon as previous sequences finish, keeping the GPU busy across a stream of variable-length requests.

PagedAttention manages the key-value cache in fixed blocks (commonly 16 tokens) across non-contiguous GPU memory, eliminating the contiguous-allocation waste that traditionally costs 60 to 80 percent of VRAM on variable-length workloads. Reported VRAM waste drops below 4 percent.

The combination delivers large throughput gains under realistic mixed-length traffic. The PagedAttention paper reports roughly 2 to 4 times higher throughput than prior state-of-the-art serving systems. It also includes benchmark-specific comparisons where static batching collapses under high sequence variance; the roughly 81 to 3480 tokens per second result belongs in that static-batching comparison, not as a general PagedAttention headline.

Cost. None on quality. Modest implementation complexity if you are not already on a serving stack that supports it.

Speculative decoding

A small draft model proposes several tokens in parallel; the large target model verifies them with a single forward pass and rejects mismatches. When the draft is accurate, multiple tokens commit per target-model forward pass; when it is wrong, the target model falls back to standard decoding.

Reported speedups: 1.5 times on general chat traffic, 2.8 times on summarization workloads using prompt lookup decoding. The acceptance rate is the lever; it depends on draft-target similarity.

Cost. Draft model must share vocabulary and tokenizer. Memory overhead for the draft model. Quality is preserved by construction (target model verifies every committed token).

Semantic caching

Embed incoming requests; look up nearest neighbors in a vector index; if similarity exceeds a threshold (commonly around 0.90 to 0.95 cosine for conservative production use), return the cached response.

Adds 5 to 20 ms for the embedding and lookup; saves 1 to 5 seconds on a cache hit. Production reports describe cache-hit rates that meaningfully reduce both LLM cost and tail latency on conversational workloads; the win concentrates on the head of the distribution.

Cost. Stale responses on dynamic content; mis-hits when the threshold is too loose. Requires a TTL strategy (5 to 15 minutes for dynamic content, hours for static). Needs an eval that scores cached responses against fresh ones on a sample of traffic; otherwise the cache silently degrades quality.

Quantization

Convert weights and activations from 16-bit floats to lower precision (INT8, FP8, INT4). A 7 B parameter model drops from 14 GB at FP16 to roughly 4 GB at INT4. Newer accelerator generations natively support FP8 paths.

Databricks reports FP8 serving for Llama2-70B on NVIDIA H100 with 2.2 times throughput improvement, around 30 percent TTFT improvement, and 50 percent model size reduction versus FP16. Treat those numbers as model-, hardware-, and stack-specific. SmoothQuant supports a different claim: W8A8 INT8 quantization with up to 1.56 times speedup and 2 times memory reduction. KV-cache quantization (INT8 or FP8) can enable larger batch sizes on the same hardware, but the exact multiplier depends on context length, runtime, and GPU memory.

Cost. Quality regression on some tasks is real; needs to be measured per workload. INT4 is more aggressive and more likely to dip on math, code, and long-context tasks. Activation quantization (versus weight-only) is more aggressive still.

Tensor parallelism

Split model weights across multiple GPUs, with synchronized AllReduce per layer. Memory per GPU drops by roughly the tensor-parallel degree (up to 80 percent on TP=8); throughput increases on models that exceed single-GPU memory or single-GPU compute.

Interconnect matters: high-bandwidth GPU-to-GPU links deliver materially higher throughput than commodity bus interconnects (reported 220 vs 120 tokens per second on a TP=4 configuration). AllReduce adds 10 to 30 percent overhead per token; KV-cache duplication adds roughly 20 percent.

Cost. Operational complexity (multi-GPU placement, NCCL tuning). Not free on small models; pays off when the model does not fit a single GPU or single-GPU throughput is the bottleneck.

Example

The following is an illustrative example. A team operates a customer-support copilot. SLO: TTFT below 800 ms at p95, ITL below 150 ms at p95, accuracy above 0.88 on a 200-example labeled set. Current setup: 13 B parameter model, single high-end GPU, static batching.

Baseline: TTFT p95 1.4 s, ITL p95 220 ms, accuracy 0.89. The latency SLO is failing; quality is fine.

Change 1: continuous batching with PagedAttention. TTFT p95 drops to 880 ms; ITL p95 drops to 170 ms. Throughput rises substantially under load. Quality unchanged. Still missing TTFT SLO at the long tail.

Change 2: semantic cache with 0.92 threshold and 10-minute TTL. Hit rate on the support workload: 28 percent. Effective TTFT p95 drops to 410 ms (cached) or 880 ms (miss); blended p95 below SLO. Quality on a 100-trace sampled audit: 0.90 (cached responses scored against fresh).

Change 3: speculative decoding with a 1 B draft model. ITL p95 drops to 105 ms on cache misses; quality unchanged because the target model verifies every token.

Change 4: FP8 quantization on weights and KV. Throughput rises 2 times; cost per request drops 45 percent. Quality dips from 0.89 to 0.86 on the labeled set. Decision: keep quantization for the high-volume cached path; revert for the high-stakes path where accuracy is non-negotiable; gate releases on the labeled set's quality score so a future quantization decision cannot regress accuracy below the floor.

Each change ships behind a quality gate: TTFT and ITL targets on a representative workload, accuracy on the labeled set. A regression on any dimension blocks the merge.

Limitations

  • TTFT and ITL are different problems. Prefill optimizations (batching, prompt caching) help TTFT; decode optimizations (speculative decoding, quantization, parallelism) help ITL. Mixing them up wastes engineering time.
  • Single-request benchmarks lie. What matters is latency under realistic concurrent load with realistic input length distribution.
  • Quantization quality loss is silent. Without a calibrated evaluator, the regression appears as a vague "vibes downgrade" weeks later. Score every release against a versioned ground truth set.
  • Caching introduces a new failure mode. Stale or mis-hit responses are hard to detect from latency alone; pair caching with a sampled freshness audit and a quality judge.
  • Tensor parallelism does not always pay off. On models that fit a single GPU, the AllReduce overhead can exceed the per-GPU compute savings.
  • Streaming hides the cost of long outputs. ITL looks fine; total response time may still be excessive if output length is not capped or guided.
  • Hardware drives the ceiling. A poorly matched stack on an older GPU has a quality-and-latency frontier that the techniques cannot extend; upgrades are sometimes the cheaper lever.

Evidence and sources

FAQ

TTFT or ITL: which do I optimize first? Whichever is failing the SLO. For chat, users complain about TTFT first; for streaming, ITL dominates. Measure both before deciding.

When does quantization hurt quality too much? INT4 is the most aggressive; INT8 and FP8 are usually safe on most tasks but should still be scored on your labeled set. Math, code, and long-context tasks are the most sensitive.

Is semantic caching worth it for low-traffic apps? Usually not. The infrastructure cost (vector store, embedding model, eviction policy) only amortizes when hit rates exceed roughly 10 percent.

Speculative decoding always helps? Only when the draft model's predictions are accepted often (commonly above 50 percent token acceptance). Test the draft model against your workload before adopting.

Do I need tensor parallelism for a 13 B model? On modern accelerators with 80 GB of memory, usually no. Single-GPU is simpler and the AllReduce overhead is non-trivial. Reach for tensor parallelism when the model does not fit or single-GPU throughput is the bottleneck.

How do I keep latency wins from silently regressing quality? Pair every latency change with a release gate on a labeled quality set. The release fails if either the latency target or the quality score regresses.

Related reading