1 Origin story

Figure 1

Quantization, in a general sense, is the process of mapping a continuous or large set of values to a smaller, discrete set. This concept has roots in signal processing and information theory —search for Vector Quantization (VQ) emerging in the late 1970s and early 1980s. Think things like the Linde-Buzo-Gray (LBG) algorithm (). VQ represents vectors from a continuous space using a finite set of prototype vectors from a “codebook,” often designed by clustering a large sample of data. The objective is to minimize the representation error (distortion) for a given codebook size (rate), a problem central to information theory’s rate-distortion trade-off.

The practical way we quantize things in NNs is… well, NNs have their own bloody-minded trade-offs and reinventions as usual. Instead of arbitrary codebooks, we typically use structured ones. The common practice of converting 32-bit floating-point (float32) numbers to 8-bit integers (int8) or even binary values is a form of scalar quantization wherein we assume that the bins are going to be ordered. Here, we are partitioning a continuous interval of real numbers into a finite set of ordered bins. The idea is that these should enable specialized hardware implementation, for speed, and also jsut be smaller to store for compression.

AFAICT there are two major types of quantization in practice

  • Uniform Quantization: This is the most common approach, where the interval is divided into bins of equal width. The affine mapping method is a form of uniform quantization. It is simple and computationally efficient, making it a default for many applications.
  • Logarithmic Quantization: In some cases, the distribution of values (like neural network weights, which are often clustered around zero) is highly non-uniform. Here, non-uniform binning can provide a better trade-off between precision and range. For instance, logarithmic quantization, which uses bins that are narrower near zero and wider further away, can represent small values with high accuracy while still covering a large overall range. You see this a lot in audio codecs, and it has some nice properties if you want to do “multiplicative things” with your signal. You don’t see this explicitly so often, but it is effectively what happens in low-bit float types like bfloat16 and fp8

The default form of quantization in neural networks is the affine mapping method. This uniform technique maps a floating-point range [rmin,rmax] to a range of B-bit integers.

The mapping from a float r to its quantized integer representation q is defined by a scale factor S (a float) and a zero-point Z (an integer). The scale factor defines the step size of the quantization, and the zero-point ensures that the real value zero can be mapped exactly to an integer.

The quantization process is given by: q=round(r/S+Z)

And the de-quantization back to a real value is: r=S(qZ)

Here, r is the approximation of the original real value r. The difference rr is the quantization error.

For a typical layer in a neural network, such as a linear layer, the computation is a matrix multiplication followed by an addition: Y=WX+b where W are the weights, X are the input activations, and b is the bias.

When we quantize the weights and activations to integer-only representations (Wq, Xq), the computation becomes: Yq=WqXq+bq This operation can be executed using integer arithmetic, which is efficient on many processors. *You can probably see many horrible corner cases that could arise though, right? The scale factors of the inputs are used to calculate the scale factor of the output to convert it back to the floating-point domain if needed.

In NNs we want to think about whether to quantise at the start, or after training. If we want to quantize at the start, we also need to think about discrete gradients.

2 Binary networks

The most aggressive form of quantization: binarization.

  • BinaryConnect_ () was a focused on quantizing only the network weights to binary values (+1 and -1) during the forward and backward passes. It demonstrated that deep neural networks could still be trained effectively despite this extreme weight quantization, which acts as a form of regularization. The major theoretical tool was Straight-Through Estimator (STE). Since the sign function used for binarization has zero gradients almost everywhere, it prevents learning via backpropagation. The STE bypasses this by simply passing the gradient from the output of the sign function directly to its input during the backward pass, treating the function as an identity for the purpose of gradient calculation.
  • DoReFa-Net () extended these ideas by proposing methods to quantize not only the weights but also the activations and gradients to low bit-widths. This allowed the entire forward and backward pass to be accelerated using efficient bitwise operations, further increasing computational efficiency.

3 Low-bit integers

Sometimes you want a couple more steps than “on” and “off”.

Jacob et al. () from Google seems to have introduced many of the things I see used in practice. This work detailed methods for both post-training quantization (PTQ), where a fully trained floating-point model is converted to an integer-based one with minimal effort, and quantization-aware training (QAT), where the model is fine-tuned or trained from scratch to simulate the effects of quantization, often leading to higher accuracy.

3.1 Quantizing Weights, Activations, or Both?

We didn’t say what was quantized!

  • Weight-only quantization primarily reduces the model’s storage size. Since the activations remain in floating-point, the computational speedup is limited as the matrix multiplications still involve float-to-integer conversions.
  • Weight and activation quantization provides the most significant benefits for inference latency. When both weights and activations are represented as integers, the core matrix multiplication operations can be performed entirely using integer arithmetic, which is much faster and more energy-efficient on compatible hardware. This is the approach detailed by Jacob et al. () and seems to be the standard for achieving maximum inference performance in industry these days.

4 References

Bai, Wang, and Liberty. 2019. ProxQuant: Quantized Neural Networks via Proximal Operators.”
Courbariaux, Bengio, and David. 2016. BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations.”
Hubara, Courbariaux, Soudry, et al. 2018. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations.” Journal of Machine Learning Research.
Jacob, Kligys, Chen, et al. 2017. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.”
Linde, Buzo, and Gray. 1980. An Algorithm for Vector Quantizer Design.” IEEE Transactions on Communications.
Meng, Bachmann, and Khan. 2020. Training Binary Neural Networks Using the Bayesian Learning Rule.” In Proceedings of the 37th International Conference on Machine Learning.
Zhou, Wu, Ni, et al. 2018. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.”