Implementing neural nets

October 14, 2016 — January 5, 2025

computers are awful
machine learning
neural nets
optimization
Figure 1

1 HOWTOs

The internet is full of guides to training neural nets. Here are some selected highlights.

Michael Nielson has a free online textbook with code examples in Python. Christopher Olah’s visual explanations make many things clear.

Andrej’s popular unromantic messy guide to training neural nets in practice has a lot of tips that people tend to rediscover the hard way if they do not get them from him. (I did)

It is allegedly easy to get started with training neural nets. Numerous libraries and frameworks take pride in displaying 30-line miracle snippets that solve your data problems, giving the (false) impression that this stuff is plug and play. … Unfortunately, neural nets are nothing like that. They are not “off-the-shelf” technology the second you deviate slightly from training an ImageNet classifier.

  • Eugene Vinitsky’s quick software tips

  • Alice’s Adventures in a Differentiable Wonderland (Scardapane 2024)

    Neural networks surround us, in the form of large language models, speech transcription systems, molecular discovery algorithms, robotics, and much more. Stripped of anything else, neural networks are compositions of differentiable primitives, and studying them means learning how to program and how to interact with these models, a particular example of what is called differentiable programming.

    This primer is an introduction to this fascinating field imagined for someone, like Alice, who has just ventured into this strange differentiable wonderland. I overview the basics of optimising a function via automatic differentiation, and a selection of the most common designs for handling sequences, graphs, texts, and audios. The focus is on an intuitive, self-contained introduction to the most important design techniques, including convolutional, attentional, and recurrent blocks, hoping to bridge the gap between theory and code (PyTorch and JAX) and leaving the reader capable of understanding some of the most advanced models out there, such as large language models (LLMs) and multimodal architectures.

  • Understanding Deep Learning (Prince 2023)

  • Dive into Deep Learning (Zhang et al. 2023)

    Interactive deep learning book with code, math, and discussions

    Implemented with PyTorch, NumPy/MXNet, JAX, and TensorFlow

    Adopted at 500 universities from 70 countries

    Source code at d2l-ai/d2l-en. They are no longer distributing the book as a PDF, but you can build it yourself

2 Profiling and performance optimisation

Start with general Python profilers; many of them have NN affordances now.

Figure 2

2.1 Compiled

See edge ML for a discussion of compiled NNs.

3 Tracking experiments

See experiment tracking in ML.

4 Configuring experiments

See configuring experiments; in practice I use Hydra for everything, but pyrallis looks good too.

5 Managing axes

A lot of the time managing deep learning is remembering which axis is which.

Noam Shazeer argues that Shape Suffixes:

  • Designate a system of single-letter names for logical dimensions, e.g. B for batch size, L for sequence length, etc., and document it somewhere in your file/project/codebase
  • When known, the name of a tensor should end in a dimension-suffix composed of those letters, e.g. input_token_id_BL for a two-dimensional tensor with batch and length dimensions.
  • That’s all.

In combination with the found Einstein convention this seems to solve all problems I have.

However, there are more heavily-engineered alternatives. Alexander Rush argues for NamedTensor. Implementations:

6 Scaling up

See Gradient Descent at Scale.

7 Incoming

8 Pre-computed/trained models

These are all hopelessly outdated now, in the era of HuggingFace.

9 NN Software

This choice is becoming less relevant in the era of easy translation via llms. I have used

I could use any of the other autodiff systems, such as…

10 References