Updated: 2026-02-03 By: Ari Heljakka
Short answer
Dockerizing an LLM is straightforward; keeping it alive under production traffic is not. A reliable image pins every version, mounts the model cache, sets
and explicitly, runs as a non-root user with build secrets injected at build time, exposes LLM-specific metrics (time-to-first-token, queue depth, KV cache use), and is gated by automated evaluation against a versioned ground truth dataset in CI/CD. Treat the container as one implementation of a serving objective whose quality must be continuously measured, not as a one-off artifact.
Key facts
- Definition: A Dockerized LLM workload is a containerized inference service (often built on vLLM, TGI, or a custom FastAPI stack) that runs an LLM behind a stable interface, with GPU access, persisted weights, and observability wired in.
- When to use: Any production deployment of a self-hosted model where reproducibility, multi-environment parity, and orchestration on Kubernetes or similar are requirements.
- Limitations: Containers do not solve model-quality drift, prompt regressions, or evaluator calibration. They give you a stable implementation; the serving objective and its quality gates still have to be defined and measured separately.
- Example: A team serving a 70B model behind an OpenAI-compatible API uses a pinned
image, multi-stage build, mounted weight cache, KEDA autoscaling on queue depth, and a CI pipeline that runs a versioned evaluator suite against every image tag before promotion.
Key takeaways
- Pin every version. Pin the base image, the CUDA toolkit, the driver expectation, and every Python dependency.
is a guarantee of silent breakage. - Configure shared memory, GPU utilization, and tensor parallelism explicitly. The defaults are tuned for non-LLM workloads.
- Expose LLM-specific metrics from day one: TTFT, TPOT, KV cache utilization, queue depth. Generic CPU and RAM panels will mislead you on autoscaling decisions.
- Bake security into the build (non-root user, no embedded secrets, scanned dependencies). Most insecure container deployments originate at build time, not at runtime.
- Treat the evaluator suite as a deployment gate alongside the container. A passing image with a regressing model is still a regression.
Definition
A Dockerized LLM workload bundles an inference engine, a model (or a deterministic loader for one), and an API surface into a reproducible container image. The container is one implementation of a serving objective: deliver responses at a target latency, throughput, and quality. The objective is independent of which inference engine you pick, which quantization you apply, or which orchestration system schedules the pod. Holding that separation explicitly is what lets you swap implementations without rewriting your evaluation criteria.
When this matters
Three pressures push teams from bare-metal serving to containerized LLM workloads:
- Environment parity. Driver, CUDA, and Python version mismatches dominate "works on my machine" incidents in LLM serving. A pinned image collapses the surface.
- Orchestration. Kubernetes, KServe, and similar platforms assume containers. Once you scale past a single GPU node, container hygiene stops being optional.
- Compliance and reproducibility. Audited deployments need a versioned, scannable artifact. A signed, pinned image is the unit auditors can reason about.
How it works
Stage 1: Environment prerequisites
- Driver and CUDA alignment. Verify the host NVIDIA driver supports the CUDA version in your base image (for example, CUDA 12.1 requires driver 525 or later). Mismatch is the most common cause of "container starts, GPU invisible."
- NVIDIA Container Toolkit installed and configured. Run
once per host, then validate with
. - Sized hardware. A 7B model needs roughly 16 GB of VRAM at FP16; a 70B model needs 80+ GB without quantization, or about 40 GB with INT4. Plan for at least 64 GB of system RAM and NVMe-class disk for weight loading.
Stage 2: Optimized image construction
- Pick a base image deliberately. A paged-attention batching engine image (such as a pinned vLLM tag) for high-throughput OpenAI-compatible serving, a text-generation-inference image for long-prompt workloads, or a plain CUDA runtime image for custom stacks. The image is one implementation of the serving objective; do not couple your evaluation criteria to which one you pick.
- Pin the tag. Never use
. Pin to a specific semver tag and update through a deliberate version bump. - Multi-stage build. Separate build dependencies from runtime. This routinely cuts image size by 60 to 90 percent and removes build-time secrets and tools from the final image.
- Pin Python dependencies. Use exact versions in
and pip with
. Add a
. Chain cleanup steps after installation to keep layers small. - Create a non-root user. Add a
directive. Never run inference as root.
Stage 3: GPU and memory configuration
- Shared memory. Pass
for a single GPU,
for multi-GPU. The default 64 MB starves NCCL. - GPU memory utilization. Set
to
. Higher values risk CUDA OOM under bursty traffic. - Tensor parallelism. Match
to the number of GPUs in the pod. Combine with
for faster GPU-to-GPU communication. - Weight cache volume. Mount
to avoid re-downloading weights on every pod restart. For air-gapped environments, bake weights into the image instead.
Stage 4: Health checks and startup
- Startup grace period. Set the health check
to 120 to 300 seconds; loading a 70B model into VRAM can take 5 to 20 minutes. - Liveness vs readiness. A liveness probe should not depend on a successful inference; a readiness probe should. Conflating them causes pods to be killed mid-warmup.
- Ingress timeouts. Set
to 300 to 600 seconds so long generations are not cut off at the proxy.
Stage 5: Performance metrics
The container must expose LLM-specific metrics. CPU and memory dashboards alone will mislead your autoscaler.
| Metric | Target | Why it matters |
|---|---|---|
| Time to first token (TTFT) p99 | under 500 ms for chat, under 5 s for RAG | The dominant component of perceived latency |
| Time per output token (TPOT) | under 50 ms | Determines streaming smoothness |
| KV cache utilization | under 85 percent sustained | Headroom for new requests |
| GPU cache usage percent | alert above 95 percent | Imminent OOM risk |
| Queue depth (requests waiting) | under 100 | Primary autoscaling signal |
Frame each of these as a normalized signal (target met = 1, severe breach = 0) so they compose into a single serving-quality scorecard alongside model-quality metrics.
Stage 6: Production deployment
- Orchestration. Kubernetes is the default once you cross two nodes or three models. KServe is worth the complexity past ten models or multi-team usage.
- Autoscaling. Use event-driven autoscaling against queue depth, not standard CPU-based horizontal pod autoscaling. Keep
at one or higher to avoid cold-start storms. A 300 second cooldown prevents thrashing. - GPU scheduling. Install the vendor's GPU device plugin. Multi-Instance GPU partitioning is useful for multi-tenant isolation but does not always cooperate with cloud autoscalers.
- Rollout strategy. Rolling updates or canary deployments, gated by an evaluator suite that scores the new image against a versioned ground truth dataset before traffic shifts.
Stage 7: Security hardening
- Non-root user. Always.
near the end of the Dockerfile. - No embedded secrets. Use BuildKit secret mounts (
) at build time and Kubernetes Secrets at runtime. Never bake an API key into a layer. - Read-only root filesystem. Set
in the pod spec, with explicit writable mounts for cache and tmp. - Capability drop.
and add back only what you need. - Image scanning in CI. A container vulnerability scanner and a Dockerfile linter in the pipeline. Block deployment on critical findings.
- Pin and sign. Pin to image digests (not just tags) and sign images for supply-chain integrity.
Stage 8: Evaluation gates
This is the step most container checklists omit. A deployable image is not a deployable model.
- Versioned ground truth dataset. A labeled set of prompts and reference outputs lives in the repo, alongside the Dockerfile. Each image tag is scored against it.
- Dimensional decomposition. Score independent dimensions (faithfulness, answer relevance, latency adherence, refusal correctness) as separate 0 to 1 metrics. Composite scores hide regressions in single dimensions.
- Calibration. Managed evaluators (LLM judges or learned classifiers) are themselves calibrated against human labels and recalibrated whenever a base model changes.
- Deployment gate. A new image tag does not promote unless all dimensions clear their thresholds. Evaluation is operational infrastructure, not a post-deployment activity.
- Model-agnostic criteria. The evaluator suite must give consistent scores whether you swap from a 13B to a 70B model, or from FP16 to INT4. That is what lets you test optimization tradeoffs without the result being confounded by judge behaviour.
Example
A team serving a fine-tuned 70B model behind an OpenAI-compatible endpoint:
- Image. A pinned batching-engine base image, multi-stage build, weights baked in for an air-gapped customer, non-root user, dependencies pinned, image digest-pinned in the deployment manifest.
- GPU config. Eight high-end GPUs per pod,
,
,
,
. - Health checks. 240 second startup grace period, readiness probe that issues a one-token completion, liveness probe that pings
. - Autoscaling. Event-driven autoscaling between two and twelve pods on the request-waiting metric, 300 second cooldown.
- Observability. A time-series metrics backend scrapes the inference engine's metrics; dashboards show TTFT p99, TPOT, queue depth, and KV cache utilization side by side with a model-quality score panel fed by the evaluator suite.
- Evaluation gate. Every image tag runs through a versioned suite of about 800 labeled examples covering five independent dimensions. A new tag does not promote unless all five hold within tolerance of the previous tag.
The result is an image whose performance, security, and quality are all measurable and gated, not assumed.
Limitations
- Container hygiene does not fix model quality. A flawless image with a regressing model is still a regression. Evaluation must run alongside.
- Weight baking trades portability for startup time. Baked weights mean faster cold starts and worse image churn. The right answer depends on your update cadence.
- Throughput optimizations affect output distribution. Continuous batching, paged attention, and quantization can shift token distributions in subtle ways. Re-run the evaluator suite whenever you change any of them.
- Driver and toolkit upgrades are coupled. A host driver upgrade can break every pinned image on the cluster. Coordinate with capacity planning.
- Cost compounds. GPU pods are expensive; idle pods are very expensive. Autoscaling on queue depth, not on CPU, is essential.
Evidence and sources
- NVIDIA Container Toolkit documentation, https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/overview.html, for driver and runtime requirements.
- vLLM project, https://github.com/vllm-project/vllm, for paged attention, continuous batching, and metric names referenced in this post.
- KEDA, https://keda.sh, for event-driven autoscaling on inference queue depth.
Evidence cap reached. For deeper analysis, see the related reading section.
FAQ
Should I bake model weights into the image or mount them? Bake them for air-gapped deployments and where startup time matters more than image size. Mount them in active development and where many models share a cache. Both are valid; the decision is operational, not architectural.
Is a high-throughput batching engine always the right choice over a chunked-prefill engine? No. Paged-attention batching engines tend to win on throughput-oriented chat workloads; chunked-prefill engines can win on long-prompt RAG. Score both implementations against the same evaluator suite on your traffic before deciding. The serving objective (latency and quality at target cost) is independent of which engine implements it.
How do I autoscale a GPU workload that takes minutes to start? Keep
high enough to absorb peak traffic, scale ahead of demand based on queue depth, and avoid scaling to zero. Cold starts of five to twenty minutes make zero-replica strategies impractical for most production LLM workloads.
Where does evaluation fit in this pipeline? Inside CI/CD, as a promotion gate. Build the image, run the evaluator suite against a versioned ground truth dataset, promote on pass, alert on regression. Evaluation belongs at the same point as integration tests, not after deployment.
What is the most common security hole in LLM containers? Embedded credentials and root-by-default. Both are build-time problems. Use BuildKit secret mounts, scan every image, run as a non-root user.