By: Ari Heljakka
Short answer
Eight-bit quantization typically halves weight memory and is often close to the 16-bit baseline. Four-bit quantization can cut raw weight memory by roughly three quarters, but its quality depends much more on the model, task, calibration path, runtime, and context length. Below four bits, regressions become much harder to predict. The decision is not just "should we quantize" but "at what precision, on what workload, on what hardware, with which runtime, validated against what versioned evaluation set." The savings are real only when the scorecard and deployment math say they are.
Key facts
- Definition: Quantization reduces the numerical precision of a model's weights (and sometimes activations), trading a small amount of accuracy for large reductions in memory, bandwidth, and compute cost.
- When to use: Whenever the deployment is bound by memory, latency, or per-token cost, and a held-out evaluation set can quantify the accuracy impact at the target precision.
- Limitations: Aggregate accuracy hides per-dimension and per-slice regressions; precision below 4-bit is unstable on many tasks; tooling quality varies across methods, formats, kernels, and hardware.
- Example: A 70B-parameter model at 4-bit precision has a raw lower-bound weight footprint of about 32.6 GiB before scales, metadata, activations, KV cache, and runtime overhead. Practical single-GPU deployment usually means a 40-48 GB class GPU or an explicit offload strategy, not a generic consumer GPU.
Key takeaways
- 8-bit and 4-bit quantization can carry small accuracy costs for large memory savings, but the safe envelope is workload-specific.
- Per-task accuracy varies; coding and math benchmarks tend to degrade more than open-ended generation.
- Aggregate benchmark scores are not a deployment gate; per-dimension and per-slice evaluation is.
- Sub-4-bit quantization (2-bit, 3-bit) regresses sharply on most current architectures.
- Method, precision, and format choices interact with the inference stack; the right choice is the one the runtime supports cleanly and the evaluation set accepts.
- The evaluation framework should outlive the precision choice; the same evaluators must score full-precision and quantized variants.
Definition
Quantization replaces high-precision floating-point weights (FP16, BF16, FP32) with lower-precision integer or low-bit floating-point representations (INT8, INT4, FP8). The model's parameter count is unchanged; the bits per parameter shrinks. Memory footprint, memory bandwidth, and matrix-multiply cost all drop proportionally, often with disproportionately small accuracy losses when the quantization scheme is well chosen.
The terms are easy to mix up, but they are not interchangeable:
- Precisions: INT8, INT4, FP8, FP16, BF16, and FP32 describe numerical representation.
- Quantization methods: GPTQ, AWQ, AutoRound, SmoothQuant, bitsandbytes, HQQ, and similar approaches describe how a model is quantized or loaded.
- File formats and runtime ecosystems: GGUF is a model file format and inference ecosystem used heavily with GGML, llama.cpp, LM Studio, GPT4All, and Ollama.
- Training approaches: QLoRA is primarily a fine-tuning approach that trains adapters over a quantized base, not a direct substitute for inference-time post-training quantization.
INT8 is broadly supported and often close to the 16-bit baseline. INT4 can be excellent, especially with methods such as GPTQ, AWQ, or AutoRound, but the error profile is more task-sensitive. FP8 is hardware-native on newer accelerators and can be attractive when the serving stack supports it cleanly. Two-bit and three-bit paths remain research or highly specialized choices for most production systems.
The choice is per workload, not per model.
When this matters
Quantization is worth the operational complexity when one or more of these holds:
- Memory or VRAM constrains which model can be served at all.
- Per-token cost is a binding constraint on margin.
- Latency targets cannot be met at full precision on the available hardware.
- The deployment fleet includes commodity or edge hardware where higher precision is not an option.
If the model already fits comfortably and per-token cost is not a binding constraint, the operational cost of running and validating a quantized variant may exceed the savings.
How it works
Why quantization works at all
Neural network weights are often robust to small perturbations. Reducing precision adds quantization noise, but the noise can be small relative to the signal the model has learned. Modern calibration-based methods such as GPTQ, AWQ, and AutoRound use representative samples to align quantization choices with real activations or weight distributions. Other inference paths, including some bitsandbytes, HQQ, and GGUF variants, can be loaded without a separate calibration job. The result is that lower precision is often recoverable without retraining, but the recovery mechanism depends on the method.
Where it stops working
Quantization noise interacts badly with tasks that depend on precise numerical reasoning, long-horizon coherence, rare-token generation, or exact output shape. Coding tasks, multi-step math, long-context workloads, and structured output formats can show larger regressions than open-ended generation at the same bit width. Below four bits, the noise floor approaches the signal on more layers, and per-task regressions become hard to predict from aggregate benchmarks.
Format choice as an operational concern
The method that scores best on a benchmark is not always the path the inference runtime handles cleanly. TGI, vLLM, TensorRT-LLM, llama.cpp, and other runtimes do not support every method, precision, kernel, and hardware target equally. A 4-bit AWQ variant may be a better fit than GPTQ in one stack and worse in another. Method and format choice is a deployment decision, not only a research decision, and the evaluation framework should be format-agnostic so any variant can be scored against the same rubrics.
Validation against versioned evaluators
The right way to validate a quantized model is the same way every other model change is validated: run the held-out evaluation set against the new variant using the existing managed evaluators, compare scorecard to scorecard, and gate on per-dimension and per-slice regressions, not on an aggregate. A quantized variant that loses 0.4 points of MMLU but loses 4 points on a customer-critical slice is not a win even if the aggregate looks fine.
Calibration data for the quantizer
GPTQ, AWQ, and AutoRound all need calibration data: a sample of representative inputs used to align the quantization grid. The closer that sample is to production distribution, the better the recovered accuracy. A model quantized with generic calibration data may underperform the same model quantized with domain-specific calibration on the same evaluation set. This is not true of every 4-bit loading path, so do not treat "4-bit" as synonymous with "calibration-based."
Example
A team operates a 70B-parameter assistant on multi-GPU servers at full precision. The cost per session is constrained by GPU rental, and the latency budget is tight. The numbers below are illustrative; a real benchmark should publish the exact model checkpoint, quantization library, calibration data, runtime, GPU SKU, driver and CUDA versions, batch size, prompt and output lengths, pricing snapshot, and repeated-run variance.
The team evaluates three variants against its existing scorecard (five dimensions, held-out set of 1,200 examples, four production slices):
- 8-bit (INT8): Roughly half the memory, no measurable per-dimension regression, throughput improvement around 1.8x. Wins on every slice.
- 4-bit (AWQ): Roughly a quarter of the raw weight memory. A 70B model still needs about 32.6 GiB for raw INT4 weights before overhead, so the practical target is a 40-48 GB class GPU, a multi-GPU setup, or an offload-aware local runtime. Aggregate accuracy is down by 1.6 points; one slice (technical troubleshooting) regresses by 4.2 points, the others by less than 1.
- 4-bit (GPTQ): Similar memory and throughput. Aggregate within 0.4 points of AWQ but regresses 5.8 points on the technical slice.
The team ships 8-bit as the default and 4-bit AWQ behind a flag for cost-bounded surfaces where the technical-troubleshooting slice does not apply. The decision is documented in the scorecard, not the chat thread. Six months later, when a new quantization format ships, the same scorecard runs against a new variant in an afternoon, because the evaluators and the dataset outlive the format choice.
Limitations
- Aggregate benchmarks lie. A model that scores within 1% on MMLU can lose 5% on the workload that pays the bills. Per-slice evaluation is not optional.
- Hardware fit is not raw weight fit. Raw parameter storage is a lower bound. KV cache, activations, quantization metadata, runtime allocations, and context length all affect whether the model actually fits on a device.
- Sub-4-bit is experimental. 2-bit and 3-bit results vary widely by architecture, format, and calibration; they belong in research workflows, not production defaults.
- Latency improvements depend on the runtime. Quantized weights only help latency if the runtime supports them natively; otherwise dequantization at load time can erase the savings.
- Cost claims are stale fast. Hardware pricing, cloud GPU availability, and per-token rates move quickly; treat cost numbers in any single source as a snapshot, not a forecast.
- Quantization noise compounds with other lossy techniques. Stacking aggressive quantization with aggressive context compression or speculative decoding can produce regressions that none of the techniques produce in isolation.
- Fine-tuning at low precision (QLoRA and friends) is a different decision. Training-time quantization has different stability characteristics than inference-time quantization; do not assume scoreboard results transfer.
Evidence and sources
- "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers," Frantar et al., 2023, https://arxiv.org/abs/2210.17323, for the foundational post-training quantization method at 4-bit.
- "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," Lin et al., 2023, https://arxiv.org/abs/2306.00978, for the activation-aware approach that often outperforms naive GPTQ on aligned tasks.
- "QLoRA: Efficient Finetuning of Quantized LLMs," Dettmers et al., 2023, https://arxiv.org/abs/2305.14314, for the training-time quantization approach combining 4-bit weights with higher-precision adapters.
- Hugging Face Transformers quantization documentation, https://huggingface.co/docs/transformers/quantization/overview, for the distinction between quantization methods and their deployment constraints.
- vLLM quantization documentation, https://docs.vllm.ai/en/latest/features/quantization/, for runtime and hardware support differences across quantization paths.
- NVIDIA TensorRT-LLM quantization documentation, https://nvidia.github.io/TensorRT-LLM/features/quantization.html, for hardware-aware FP8, INT8, and INT4 serving considerations.
- Hugging Face GGUF documentation, https://huggingface.co/docs/hub/gguf, for GGUF as a file format and local inference ecosystem rather than a quantization algorithm.
- Meta Llama 3.1 70B model card, https://huggingface.co/meta-llama/Meta-Llama-3.1-70B, and GPU vendor specifications for memory-capacity checks when estimating whether a 70B 4-bit model fits on one device.
FAQ
Is 8-bit always safe? Often, but not always. Aggregate regressions are frequently small, but per-slice evaluation is still the gate because exceptions exist.
Is 4-bit safe for production? Sometimes yes, with the right method, runtime, hardware, and validation against per-dimension and per-slice scores. Coding, long-context, and multi-step reasoning workloads can regress more than open-ended generation and may not tolerate the cut.
GPTQ, AWQ, or GGUF: which format wins? None universally, and the names do not describe the same kind of thing. GPTQ and AWQ are quantization methods or model families; GGUF is a file format and local inference ecosystem. The right choice is the one the deployment runtime handles cleanly and the evaluation set says is acceptable.
Does quantization affect tool calling and structured output? It can. Schema adherence is sensitive to rare-token probabilities and exact output shape, but this should be treated as a workload hypothesis until tested. Run tool-calling and JSON-mode workloads explicitly against the evaluator set rather than assuming aggregate scores transfer.
Can we quantize and then fine-tune? QLoRA-style fine-tuning of a 4-bit base with higher-precision adapters works well for training cost reduction. The resulting model behaves differently from post-training quantization of an already-tuned full-precision model; validate accordingly.
How often does the cost-versus-accuracy tradeoff need to be reevaluated? Whenever the model changes, the runtime changes, or the workload distribution changes. The evaluation framework is what makes that re-validation cheap; if every revisit requires hand-curating a new dataset, the savings disappear.