Updated: 2026-02-20 By: Ari Heljakka
Short answer
Standard load balancers were designed for short, uniform HTTP requests. LLM requests are neither. A simple prompt may take 50 milliseconds; a long-context generation may take 45 seconds. Round-robin and least-connections heuristics treat both equally and create cascading queue buildup, GPU memory thrash, and TTFT spikes. Reliable LLM serving requires routers that understand tokens, KV cache locality, predicted latency, and per-provider availability. The right load balancer is itself one implementation of a serving objective (latency, throughput, reliability) that must be measured against a versioned ground truth dataset alongside any other deployment change.
Key facts
- Definition: An LLM load balancer routes inference requests across a pool of workers (and often across providers) using policies aware of token cost, KV cache state, predicted latency, and provider health, rather than connection count alone.
- When to use: Any production deployment with more than one worker or more than one provider, where tail latency, GPU utilization, or provider availability matter.
- Limitations: Better routing does not fix a slow model, a wrong prompt, or a missing eval suite. It addresses the serving layer; the model layer needs its own measurement loop.
- Example: A team running a chat product moves from round-robin across four GPU nodes to a token-aware router with KV-cache stickiness and multi-provider failover. p99 TTFT drops materially; provider outages stop showing up as user-facing errors.
Key takeaways
- Treating LLM inference as standard HTTP traffic is the most common cause of avoidable tail-latency incidents.
- Token-aware routing, predicted-latency scheduling, and KV-cache stickiness are independent dimensions; the right policy composes them rather than picking one.
- Multi-provider failover converts provider-side incidents from outages into degraded modes, but only if the routing policy and the evaluation criteria stay model-agnostic.
- A load balancer is itself a versioned implementation of a serving objective. Evaluate it against a ground truth dataset on every change.
- Score routing decisions continuously: per-route latency, per-route error rate, per-route quality. Composition is additive; each dimension is its own 0 to 1 metric.
Definition
An LLM load balancer sits between the API surface and the pool of workers. Its job is to pick a worker for each request such that the system meets its serving objective: target latency at target throughput at target cost with target reliability. The objective is independent of the routing policy that implements it; the routing policy is independent of which model or provider is behind any given worker.
Typical components:
- Pool of workers. Could be a fleet of self-hosted GPU pods, a set of provider endpoints, or both.
- Health signals. Per-worker latency histograms, GPU memory, KV cache utilization, queue depth, recent error rate.
- Routing policy. A function from (request, pool state) to a chosen worker.
- Failover logic. A multi-stage retry policy for when the chosen worker fails or times out.
- Observability. Per-route metrics that feed both autoscaling and continuous evaluation.
When this matters
Three pressures force production teams off naive load balancers:
- Bimodal request cost. A request mix that includes both 200-token chats and 8,000-token RAG generations cannot be routed by connection count alone. The cheap requests queue behind the expensive ones, and tail latency explodes.
- KV cache locality. Multi-turn conversations are dramatically cheaper if subsequent turns land on the worker that still holds the prior turn's KV cache. Random routing throws that locality away.
- Provider availability. Single-provider deployments inherit single-provider outages. Multi-provider routing converts an outage into a degraded mode, but only with explicit failover policy and reliable health signals.
How it works
Primitive 1: token-aware routing
Track each worker's estimated in-flight token load (prefill plus decode), not its connection count. Route to the worker with the lowest token-weighted backlog. Two effects compound: small requests stop queuing behind large ones, and the variance of per-request latency drops sharply.
Primitive 2: KV cache stickiness
When a request belongs to an existing conversation (or shares a long shared prefix), route it to the worker that holds the cached state. Implementations use consistent hashing on a conversation ID or a hash of the shared prefix. Stickiness matters most for multi-turn chat; it matters little for one-shot completions.
Primitive 3: predicted-latency scheduling
Train a small model (gradient-boosted trees are common) to predict time-to-first-token and time-per-output-token for a given (worker, request) pair from features like prompt length, current queue, KV cache state, and recent worker latency. Route to the worker with the lowest predicted total latency. This is where the largest measured TTFT improvements come from; reported reductions in the range of 70 percent on p50 are not unusual after switching from connection-count routing.
Primitive 4: separated prefill and decode
For long-context workloads, prefill (the one-shot compute on the input) and decode (token-by-token generation) have different bottlenecks. Routing them to different worker pools lets each pool be tuned for its own compute pattern. Combined with chunked prefill, this approach can substantially reduce p95 TTFT on long-prompt traffic.
Primitive 5: multi-provider failover
Treat each provider endpoint as one more worker in the pool, with its own health signal. A typical scoring rubric weighs uptime, throughput, price, and latency (commonly 50 / 20 / 20 / 10). Failover is multi-stage:
- Retry the same provider with backoff for transient errors.
- Retry a different region of the same provider for regional outages.
- Retry a different provider entirely for provider-wide outages.
- Surface the failure with a clear error and route the trace to evaluation review.
The router must remain model-agnostic: the serving objective and the evaluation criteria must give consistent scores whether the request landed on provider A's flagship model or on a self-hosted fallback. Otherwise multi-provider routing silently changes the product behind users' backs.
Primitive 6: continuous evaluation of routing
The routing policy itself is one implementation of the serving objective. Score it. Compose independent dimensions:
| Dimension | Measurement | 0 to 1 mapping |
|---|---|---|
| Latency | p99 TTFT vs target | 1 at target, 0 at 4x target |
| Throughput | sustained tokens/sec vs target | 1 at target, 0 at half target |
| Reliability | success rate over window | 1 at four nines, 0 at two nines |
| Quality consistency | evaluator score variance across workers | 1 at zero variance, 0 at policy-violating divergence |
Each dimension is a separate metric. A new routing policy goes through the same kind of CI gate as a model change: it must clear all dimensions on a versioned scenario set before promotion.
Example
A team running a chat product on four self-hosted GPU nodes plus a managed provider as failover.
Before. Standard round-robin at the cloud load balancer. p99 TTFT spikes above five seconds during traffic bursts. KV cache hit rate sits near zero because conversations bounce across nodes. A provider-side incident manifests as a five-minute outage to users.
Change. The team introduces a token-aware router with three policies layered:
- Hash the conversation ID and prefer the worker holding the prior KV cache.
- Within the candidates, pick the lowest token-weighted backlog.
- Fall back to a predicted-latency scheduler when no cache match exists.
Provider failover is configured as a fourth pool with a multi-stage retry policy.
After. p99 TTFT drops markedly on multi-turn traffic; the KV cache hit rate climbs above 80 percent on chat. The next provider-side incident degrades to a slower response (provider fallback kicks in) rather than an outage.
Evaluation gate. The new routing policy was scored against a versioned scenario set: 200 chat sessions, 100 long-prompt completions, three simulated provider outages. The policy had to clear all four dimensions (latency, throughput, reliability, quality consistency) before promotion. A regression in any dimension blocks the rollout.
Limitations
- Routing does not fix model quality. A faster route to a regressing model is still a regression. Model-quality evaluators must run independently.
- KV cache stickiness is fragile under scale-down. When a worker is removed from the pool, its cached state goes with it. Plan for cache miss storms during scale-down.
- Predicted-latency models drift. They are themselves models; their accuracy decays as traffic patterns or worker hardware change. Treat the prediction model as a managed component with its own calibration loop.
- Multi-provider failover hides behavior shifts. If different providers serve different models, the user experience changes silently during failover. Evaluators must score outputs consistently across providers; otherwise failover masks quality regressions.
- Cost of complexity. A routing layer is one more piece of infrastructure to operate. The benefits are real only when traffic patterns or provider risk justify the complexity.
Evidence and sources
- vLLM project documentation, https://docs.vllm.ai/, for paged attention, continuous batching, and the inference-server metrics that production routing policies depend on.
- "The Tail at Scale" by Jeffrey Dean and Luiz Andre Barroso, https://research.google/pubs/the-tail-at-scale/, for the foundational treatment of why latency-aware routing matters for systems with bimodal request cost.
- The SkyWalker scheduling paper, https://arxiv.org/abs/2505.24095, for one published treatment of predicted-latency routing in LLM serving.
Evidence cap reached. For deeper analysis, see the related reading section.
FAQ
Is a hosted LLM router worth it over self-implemented routing? Depends on traffic and risk. Self-implemented routing is fine when you have one provider and predictable traffic; a more sophisticated router earns its complexity when you cross multiple providers, run mixed workloads, or need explicit SLOs.
How big a latency win should I expect from token-aware routing? For bursty, mixed-cost traffic, double-digit percent p99 TTFT improvement is common. The exact win depends on how bimodal your traffic is; uniform traffic sees less benefit.
Does KV cache stickiness help one-shot completions? Not much. Stickiness is a multi-turn optimization. For one-shot completions, predicted-latency or token-aware policies dominate.
How do I keep multi-provider failover from silently changing model behavior? Score outputs from every provider against the same evaluator suite. If the suite's per-provider scores diverge above your tolerance, the failover policy is changing your product, not just your latency. Either equalize behavior with prompts or constrain the fallback to a closer match.
Should the routing policy be in CI? Yes. Treat routing changes the same as model changes: a versioned scenario set, all dimensions scored, regressions block promotion.