Discretizing and quantizing neural nets
2025-07-25 — 2025-07-25
1 Origin story
Quantization, in a general sense, is the process of mapping a continuous or large set of values to a smaller, discrete set. This concept has roots in signal processing and information theory —search for Vector Quantization (VQ) emerging in the late 1970s and early 1980s. Think things like the Linde-Buzo-Gray (LBG) algorithm (Linde, Buzo, and Gray 1980). VQ represents vectors from a continuous space using a finite set of prototype vectors from a “codebook,” often designed by clustering a large sample of data. The objective is to minimize the representation error (distortion) for a given codebook size (rate), a problem central to information theory’s rate-distortion trade-off.
The practical way we quantize things in NNs is… well, NNs have their own bloody-minded trade-offs and reinventions as usual. Instead of arbitrary codebooks, we typically use structured ones. The common practice of converting 32-bit floating-point (float32
) numbers to 8-bit integers (int8
) or even binary values is a form of scalar quantization wherein we assume that the bins are going to be ordered. Here, we are partitioning a continuous interval of real numbers into a finite set of ordered bins. The idea is that these should enable specialized hardware implementation, for speed, and also jsut be smaller to store for compression.
AFAICT there are two major types of quantization in practice
- Uniform Quantization: This is the most common approach, where the interval is divided into bins of equal width. The affine mapping method is a form of uniform quantization. It is simple and computationally efficient, making it a default for many applications.
- Logarithmic Quantization: In some cases, the distribution of values (like neural network weights, which are often clustered around zero) is highly non-uniform. Here, non-uniform binning can provide a better trade-off between precision and range. For instance, logarithmic quantization, which uses bins that are narrower near zero and wider further away, can represent small values with high accuracy while still covering a large overall range. You see this a lot in audio codecs, and it has some nice properties if you want to do “multiplicative things” with your signal. You don’t see this explicitly so often, but it is effectively what happens in low-bit float types like
bfloat16
andfp8
The default form of quantization in neural networks is the affine mapping method. This uniform technique maps a floating-point range
The mapping from a float
The quantization process is given by:
And the de-quantization back to a real value is:
Here,
For a typical layer in a neural network, such as a linear layer, the computation is a matrix multiplication followed by an addition:
When we quantize the weights and activations to integer-only representations (
In NNs we want to think about whether to quantise at the start, or after training. If we want to quantize at the start, we also need to think about discrete gradients.
2 Binary networks
The most aggressive form of quantization: binarization.
- BinaryConnect_ (Courbariaux, Bengio, and David 2016) was a focused on quantizing only the network weights to binary values (+1 and -1) during the forward and backward passes. It demonstrated that deep neural networks could still be trained effectively despite this extreme weight quantization, which acts as a form of regularization. The major theoretical tool was Straight-Through Estimator (STE). Since the sign function used for binarization has zero gradients almost everywhere, it prevents learning via backpropagation. The STE bypasses this by simply passing the gradient from the output of the sign function directly to its input during the backward pass, treating the function as an identity for the purpose of gradient calculation.
- DoReFa-Net (Zhou et al. 2018) extended these ideas by proposing methods to quantize not only the weights but also the activations and gradients to low bit-widths. This allowed the entire forward and backward pass to be accelerated using efficient bitwise operations, further increasing computational efficiency.
3 Low-bit integers
Sometimes you want a couple more steps than “on” and “off”.
Jacob et al. (2017) from Google seems to have introduced many of the things I see used in practice. This work detailed methods for both post-training quantization (PTQ), where a fully trained floating-point model is converted to an integer-based one with minimal effort, and quantization-aware training (QAT), where the model is fine-tuned or trained from scratch to simulate the effects of quantization, often leading to higher accuracy.
3.1 Quantizing Weights, Activations, or Both?
We didn’t say what was quantized!
- Weight-only quantization primarily reduces the model’s storage size. Since the activations remain in floating-point, the computational speedup is limited as the matrix multiplications still involve float-to-integer conversions.
- Weight and activation quantization provides the most significant benefits for inference latency. When both weights and activations are represented as integers, the core matrix multiplication operations can be performed entirely using integer arithmetic, which is much faster and more energy-efficient on compatible hardware. This is the approach detailed by Jacob et al. (2017) and seems to be the standard for achieving maximum inference performance in industry these days.