Discretizing and quantizing neural nets
2025-07-25 — 2026-06-13
In Which Neural Network Quantization Is Surveyed From Its Vector-Quantization Roots to Binary Extremes, With the Role of Calibration Data in Distributing Precision Across Weights Examined.
Quantization, in the general sense, maps a continuous or large set of values to a smaller, discrete set. The notion has roots in signal processing and information theory; Vector Quantization (VQ) emerged in the late 1970s and early 1980s, e.g. the Linde-Buzo-Gray algorithm (Linde, Buzo, and Gray 1980). VQ represents vectors from a continuous space using a finite set of prototype vectors from a “codebook,” typically designed by clustering a large sample of data. The objective is to minimise representation error (distortion) for a given codebook size (rate) — the rate-distortion trade-off at the centre of information theory.
The way we quantize NNs has its own bloody-minded reinventions, as usual. Instead of arbitrary codebooks, we typically use structured ones. Converting float32 weights and activations to int8 (or to binary values) is scalar quantization: we partition a continuous interval of real numbers into a finite set of ordered bins. The motivations are speed (specialised hardware) and storage size (see also compression), so that we can run on the edge.
Making neural networks small and fast, which is the point of this technique, is a high value problem and therefore lots of money is pouring into it. I am almost certainly missing important recent developments.
1 Uniform vs logarithmic
Scalar quantization comes in two main families.
- Uniform quantization divides the data interval into bins of equal width. The affine mapping below is the canonical example. Simple, computationally cheap, the default.
- Logarithmic quantization uses bins narrower near zero and wider further out. Useful when the value distribution is also fine near zero and coarse further out. This is specifically useful for NN weights that cluster around zero and audio signals that span many orders of magnitude. We rarely write down “logarithmic quantization” explicitly in NN papers, but it effectively happens in low-bit float types like
bfloat16andfp8.
2 Affine mapping
The default in NN quantization. Map a floating-point range \([r_{min}, r_{max}]\) to a range of \(B\)-bit integers via a scale factor \(S\) (a float) and a zero-point \(Z\) (an integer). \(S\) is the step size; \(Z\) ensures the real value zero maps exactly to an integer.
Quantize: \[ q = \text{round}(r/S + Z) \]
Dequantize: \[ r' = S(q - Z) \]
\(r'\) approximates the original \(r\); the quantization error is \(r - r'\).
For a typical linear layer we want \(Y = WX + b\), where \(W\) is the weight matrix, \(X\) are the input activations, and \(b\) is the bias. With weights and activations quantized to integers (\(W_q\), \(X_q\)): \[ Y_q = W_q X_q + b_q \] This runs in integer arithmetic, which is fast on most processors. The output’s scale factor is computed from the inputs’, and we convert back to float if a downstream layer needs that. Many corner cases — overflow, requantization, asymmetric ranges — I’m eliding here; see Jacob et al. (2017) for gory OCD details.
3 What to quantize
Now, which calculations do we quantize?
- Weight-only quantization shrinks the model on disk and in memory. Activations stay in floating point. AFAICT this means that the matrix multiplication still pays a per-call dequantisation cost (cast int weights → float, then do a float-float matmul).
- Weight and activation quantization is full-stack. With both as integers, the matmul itself runs entirely in integer arithmetic, which is faster and more energy-efficient on compatible hardware.
For LLMs, whose output is a discrete token distribution anyway, full-stack feels well-matched.
4 When to quantize
Do we quantize a fully-trained float model, or train with quantization in the loop? Terms of art: post-training (PTQ) vs quantization-aware training (QAT).
Jacob et al. (2017) distinguishes:
- Post-training quantization (PTQ) — convert a trained float model with minimal effort. Cheap, sometimes lossy.
- Quantization-aware training (QAT) — fine-tune (or train from scratch) with simulated quantization in the forward pass, so the network learns weights that round nicely. More effort, usually higher accuracy at low bit-widths.
If we go all the way and train directly in quantized weights, we might start working with discrete gradients.
5 Calibration data
Not every weight deserves the same number of bits. The weights that are most important (in the sense that they move the output most) should be quantized most carefully. To find out which ones those are, we can run the full-resolution float model over a sample of text and watch which activations light up. This is called a calibration pass. The statistics we collect tell the quantizer where to spend its limited precision budget.
Terms of art: the importance matrix (imatrix) feeding llama.cpp’s K-quant and IQ-quant formats; activation quantization (AQ) in the GPU world; activation-aware salient-channel scaling in AWQ (Lin et al. 2023); and second-order Hessian information in GPTQ (Frantar et al. 2023). I don’t know how all of these work, but I have messed around with imatrix.
The imatrix is computed by running the float model over a special calibration corpus chosen to exercise all the capabilities we care about. A layer’s weights are a matrix mapping the input vector to the output vector, with one column for each entry of the input vector — the weights that this one activation gets multiplied by, so a big, frequent activation makes its column’s rounding errors matter more. While doing that inference we accumulate, per weight column, a sum of squared activations — a cheap proxy for “how much does perturbing this column hurt the output?”. The K-quant and IQ-quant routines then hand more bits (or smaller rounding error) to the salient columns. There is no runtime cost: the imatrix is consumed at quantization time and thrown away.
The imatrix, AWQ and GPTQ are the same idea at different resolutions (activation quantization is the odd one out — it’s about the precision of the activations themselves, not which weights to protect). That per-column sum of squared activations is the diagonal of the activations’ second-moment matrix \(\mathbb{E}[xx^\top]\) — one number per input channel, how hard it gets driven. AWQ reads off per-channel magnitudes to find and scale its salient channels; GPTQ keeps the whole matrix, off-diagonal correlations and all, as the Hessian it inverts to shunt rounding error onto the weights that can absorb it. Keeping more of the matrix trades compute for a sharper error model.
Williams and Aletras (2024) ran an experiment and found that the choice of calibration set has a measurable, sometimes large, effect on downstream accuracy after PTQ and pruning, so take care.
The especially well-regarded calibration file bartowski1182, curated by bartowski, was the de-facto standard imatrix corpus at time of writing. It is hand-assembled to hit the “hard” modes — code, maths problems, multilingual snippets, chat — rather than sampling random lines from a web dump, on the theory that a tougher, more diverse calibration workload yields activation statistics closer to how the model gets used in anger. Downstream datasets like lemon07r/pile-calibration package it up with Pile-10k as a plug-and-play corpus for quantization scripts.
There are ongoing arguments (e.g. r/LocalLLaMA, llama.cpp devs) about whether imatrix quants overfit to the calibration distribution. Lore is that a good imatrix helps at low bit-widths and that the marginal differences between reasonable calibration sets are small and hard to measure.
6 Bit-budget regimes
6.1 Low-bit integers
Very popular for offline consumer hardware: 8-bit, 4-bit, sometimes 2-bit integers.
Ternary quantization — three levels, typically \(\{-1, 0, +1\}\) — sounds handy. PrismML’s Bonsai is the first one of these I saw and they make impressive claims about training using this ternary layout:
6.2 Binary networks
The limit case: one bit per weight (or per activation).
- BinaryConnect (Courbariaux, Bengio, and David 2016) quantizes weights to \(\{-1, +1\}\) during forward and backward passes. Deep networks can apparently still train under this extreme constraint (!).
- DoReFa-Net (Zhou et al. 2018) extends binarization to weights, activations and gradients at low bit-widths, so the entire forward and backward pass can run on bitwise operations.
