Discretizing and quantizing neural nets

2025-07-25 — 2026-04-27

Wherein affine mappings between floating-point and integer representations are described, and post-training quantization is distinguished from quantization-aware training with the Straight-Through Estimator.

edge computing
machine learning
model selection
neural nets
sparser than thou
Figure 1

Quantization, in the general sense, maps a continuous or large set of values to a smaller, discrete set. The notion has roots in signal processing and information theory; Vector Quantization (VQ) emerged in the late 1970s and early 1980s, e.g. the Linde-Buzo-Gray algorithm (Linde, Buzo, and Gray 1980). VQ represents vectors from a continuous space using a finite set of prototype vectors from a “codebook,” typically designed by clustering a large sample of data. The objective is to minimise representation error (distortion) for a given codebook size (rate) — the rate-distortion trade-off at the centre of information theory.

The way we quantize NNs has its own bloody-minded reinventions, as usual. Instead of arbitrary codebooks, we typically use structured ones. Converting float32 weights and activations to int8 (or to binary values) is scalar quantization: we partition a continuous interval of real numbers into a finite set of ordered bins. The motivations are speed (specialised hardware) and storage size (see also compression).

Making neural networks small and fast, which is the point of this technique, is a high value problem and therefore lots of money is pouring into it. I am almost certainly missing some recent developments.

1 Uniform vs logarithmic

Scalar quantization comes in two main families.

  • Uniform quantization divides the interval into bins of equal width. The affine mapping below is the canonical example. Simple, computationally cheap, the default.
  • Logarithmic quantization uses bins narrower near zero and wider further out. Useful when the value distribution is non-uniform — NN weights cluster around zero, audio signals span many orders of magnitude. We rarely write down “logarithmic quantization” explicitly in NN papers, but it effectively happens in low-bit float types like bfloat16 and fp8.

2 Affine mapping

The default in NN quantization. Map a floating-point range \([r_{min}, r_{max}]\) to a range of \(B\)-bit integers via a scale factor \(S\) (a float) and a zero-point \(Z\) (an integer). \(S\) is the step size; \(Z\) ensures the real value zero maps exactly to an integer.

Quantize: \[ q = \text{round}(r/S + Z) \]

Dequantize: \[ r' = S(q - Z) \]

\(r'\) approximates the original \(r\); the quantization error is \(r - r'\).

For a typical linear layer we want \(Y = WX + b\), where \(W\) is the weight matrix, \(X\) are the input activations, and \(b\) is the bias. With weights and activations quantized to integers (\(W_q\), \(X_q\)): \[ Y_q = W_q X_q + b_q \] This runs in integer arithmetic, which is fast on most processors. The output’s scale factor is computed from the inputs’, and we convert back to float if a downstream layer needs it. Many corner cases — overflow, requantization, asymmetric ranges — I’m eliding here; see Jacob et al. (2017) for the gory detail.

3 What gets quantized

Orthogonal to bit-width: which tensors do we quantize?

  • Weight-only quantization shrinks the model on disk and in memory. Activations stay in floating point, and the matrix multiplication still pays a per-call dequantization cost (int weights → float, then float-float matmul).
  • Weight and activation quantization — full-stack. With both as integers, the matmul itself runs entirely in integer arithmetic, which is faster and more energy-efficient on compatible hardware.

For LLMs, whose output is a discrete token distribution anyway, full-stack feels well-matched. AFAICT it’s also where most production systems are headed.

4 When to quantize

Do we quantize a fully-trained float model, or train with quantization in the loop? Terms of art: post-training (PTQ) vs quantization-aware training (QAT).

Jacob et al. (2017) distinguishes:

  • Post-training quantization (PTQ) — convert a trained float model with minimal effort. Cheap, sometimes lossy.
  • Quantization-aware training (QAT) — fine-tune (or train from scratch) with simulated quantization in the forward pass, so the network learns weights that round nicely. More effort, usually higher accuracy at low bit-widths.

QAT is where we start working with discrete gradients: if the forward pass rounds, the backward pass needs something to differentiate through, e.g. the Straight-Through Estimator (STE).

5 Bit-budget regimes

5.1 Low-bit integers

The middle ground: 8-bit, 4-bit, sometimes 2-bit (ternary) integers. Most production pipelines I’ve seen live in this band.

Ternary quantization — three levels, typically \(\{-1, 0, +1\}\) — sounds attractive. PrismML’s Bonsai is one example I’ve encountered.

5.2 Binary networks

The limit case: one bit per weight (or per activation).

  • BinaryConnect (Courbariaux, Bengio, and David 2016) quantizes weights to \(\{-1, +1\}\) during forward and backward passes. Deep networks still train under this extreme constraint, which acts as a regularizer.
  • DoReFa-Net (Zhou et al. 2018) extends binarization to weights, activations and gradients at low bit-widths, so the entire forward and backward pass can run on bitwise operations.

6 References

Bai, Wang, and Liberty. 2019. ProxQuant: Quantized Neural Networks via Proximal Operators.”
Courbariaux, Bengio, and David. 2016. BinaryConnect: Training Deep Neural Networks with Binary Weights During Propagations.”
Hubara, Courbariaux, Soudry, et al. 2018. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations.” Journal of Machine Learning Research.
Jacob, Kligys, Chen, et al. 2017. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference.”
Linde, Buzo, and Gray. 1980. An Algorithm for Vector Quantizer Design.” IEEE Transactions on Communications.
Meng, Bachmann, and Khan. 2020. Training Binary Neural Networks Using the Bayesian Learning Rule.” In Proceedings of the 37th International Conference on Machine Learning.
Zhou, Wu, Ni, et al. 2018. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.”