Discretizing and quantizing neural nets

2025-07-25 — 2025-07-25

Wherein the affine mapping of float32 ranges to 8‑bit integers via a scale factor and zero‑point is described, integer‑only matrix multiplies for inference are delineated, and post‑training versus quantization‑aware training are noted.

edge computing

machine learning

model selection

neural nets

sparser than thou

1 Origin story

Quantization, in a general sense, is the process of mapping a continuous or large set of values to a smaller, discrete set. This concept has roots in signal processing and information theory —search for Vector Quantization (VQ) emerging in the late 1970s and early 1980s. Think of things like the Linde-Buzo-Gray (LBG) algorithm (Linde, Buzo, and Gray 1980). VQ represents vectors from a continuous space using a finite set of prototype vectors from a “codebook,” often designed by clustering a large sample of data. The objective is to minimize the representation error (distortion) for a given codebook size (rate), a problem central to information theory’s rate-distortion trade-off.

The practical way we quantize things in NNs is… well, NNs have their own bloody-minded trade-offs and reinventions as usual. Instead of arbitrary codebooks, we typically use structured ones. The common practice of converting 32-bit floating-point (float32) numbers to 8-bit integers (int8) or even binary values is a form of scalar quantization wherein we assume that the bins are ordered. Here, we are partitioning a continuous interval of real numbers into a finite set of ordered bins. The idea is that these should enable specialized hardware implementation, for speed, and also are also just smaller to store for compression.

As far as I can tell there are two major types of quantization in practice

Uniform Quantization: This is the most common approach, where the interval is divided into bins of equal width. The affine mapping method is a form of uniform quantization. It is simple and computationally efficient, making it a default for many applications.
Logarithmic Quantization: In some cases, the distribution of values (like neural network weights, which are often clustered around zero) is highly non-uniform. Here, non-uniform binning can provide a better trade-off between precision and range. For instance, logarithmic quantization, which uses bins that are narrower near zero and wider further away, can represent small values with high accuracy while still covering a large overall range. We see this a lot in audio codecs, and it has nice properties when we want to do “multiplicative things” with our signal. We don’t see this explicitly so often, but it effectively happens in low-bit float types like bfloat16 and fp8

The default form of quantization in neural networks is the affine mapping method. This uniform technique maps a floating-point range \([r_{min}, r_{max}]\) to a range of \(B\)-bit integers.

The mapping from a float \(r\) to its quantized integer representation \(q\) is defined by a scale factor \(S\) (a float) and a zero-point \(Z\) (an integer). The scale factor defines the step size of the quantization, and the zero-point ensures that the real value zero can be mapped exactly to an integer.

The quantization process is given by: \[ q = \text {round}(r/S + Z) \]

The dequantization back to a real value is: \[ r’ = S(q—Z) \]

Here, \(r’\) approximates the original real value \(r\). The quantization error is the difference \(r—r’\).

For a typical neural network layer, like a linear layer, the computation is a matrix multiplication followed by an addition: \(Y = WX + b\) where \(W\) is the weight matrix, \(X\) are the input activations, and \(b\) is the bias.

When we quantize the weights and activations to integer-only representations (\(W_q\), \(X_q\)), the computation becomes: \[ Y_q = W_q X_q + b_q \] This operation can be executed using integer arithmetic, which is efficient on many processors. *We can probably see many horrible corner cases that could arise, though, right? We use the inputs’ scale factors to compute the output’s scale factor, then convert it back to the floating-point domain if needed.

In NNs we want to think about whether to quantize at the start or after training. If we want to quantize at the start, we also need to think about discrete gradients.

2 Binary networks

The most aggressive form of quantization is binarization.

BinaryConnect_ (Courbariaux, Bengio, and David 2016) focused on quantizing only the network weights to binary values (+1 and -1) during the forward and backward passes. It showed that deep neural networks can still be trained effectively despite this extreme weight quantization, which acts as a form of regularization. The main theoretical tool was the Straight-Through Estimator (STE). Since the sign function used for binarization has zero gradients almost everywhere, it stops learning via backprop. The STE bypasses this by passing the gradient from the sign function’s output straight to its input during the backward pass, treating the function as an identity for gradient calculation.
DoReFa-Net (Zhou et al. 2018) extended this by quantizing weights, activations and gradients to low bit-widths. That lets the whole forward and backward pass be accelerated with efficient bitwise operations, further improving computational efficiency.

3 Low-bit integers

Sometimes we want a couple more steps than “on” and “off”.

Jacob et al. (2017) from Google seems to have introduced many of the ideas I see used in practice. This work detailed methods for both post-training quantization (PTQ), where a fully trained floating-point model is converted to an integer-based one with minimal effort, and quantization-aware training (QAT), where the model is fine-tuned or trained from scratch to simulate the effects of quantization, often leading to higher accuracy.

3.1 Quantizing Weights, Activations, or Both?

We didn’t say what was quantized!

Weight-only quantization primarily reduces the model’s storage size. Since the activations remain in floating-point, the computational speedup is limited as the matrix multiplications still involve float-to-integer conversions.
Weight and activation quantization provides the most significant benefits for inference latency. When both weights and activations are represented as integers, the core matrix multiplication operations can be performed entirely using integer arithmetic, which is much faster and more energy-efficient on compatible hardware. This is the approach detailed by Jacob et al. (2017) and seems to be the standard for achieving maximum inference performance in industry today.

4 References

Bai, Wang, and Liberty. 2019. “ProxQuant: Quantized Neural Networks via Proximal Operators.”

Courbariaux, Bengio, and David. 2016. “BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations.”

Hubara, Courbariaux, Soudry, et al. 2018. “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations.” Journal of Machine Learning Research.

Jacob, Kligys, Chen, et al. 2017. “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.”

Linde, Buzo, and Gray. 1980. “An Algorithm for Vector Quantizer Design.” IEEE Transactions on Communications.

Meng, Bachmann, and Khan. 2020. “Training Binary Neural Networks Using the Bayesian Learning Rule.” In Proceedings of the 37th International Conference on Machine Learning.

Zhou, Wu, Ni, et al. 2018. “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.”