Implementing neural nets

2016-10-14 — 2025-01-05

computers are awful

machine learning

neural nets

optimization

Suspiciously similar content

1 HOWTOs

The internet is full of guides to training neural nets. Here are some selected highlights.

Michael Nielson has a free online textbook with code examples in Python. Christopher Olah’s visual explanations make many things clear.

Andrej’s popular unromantic messy guide to training neural nets in practice has a lot of tips that people tend to rediscover the hard way if they do not get them from him. (I did)

It is allegedly easy to get started with training neural nets. Numerous libraries and frameworks take pride in displaying 30-line miracle snippets that solve your data problems, giving the (false) impression that this stuff is plug and play. … Unfortunately, neural nets are nothing like that. They are not “off-the-shelf” technology the second you deviate slightly from training an ImageNet classifier.

Eugene Vinitsky’s quick software tips
Alice’s Adventures in a Differentiable Wonderland (Scardapane 2024)

Neural networks surround us, in the form of large language models, speech transcription systems, molecular discovery algorithms, robotics, and much more. Stripped of anything else, neural networks are compositions of differentiable primitives, and studying them means learning how to program and how to interact with these models, a particular example of what is called differentiable programming.

This primer is an introduction to this fascinating field imagined for someone, like Alice, who has just ventured into this strange differentiable wonderland. I overview the basics of optimising a function via automatic differentiation, and a selection of the most common designs for handling sequences, graphs, texts, and audios. The focus is on an intuitive, self-contained introduction to the most important design techniques, including convolutional, attentional, and recurrent blocks, hoping to bridge the gap between theory and code (PyTorch and JAX) and leaving the reader capable of understanding some of the most advanced models out there, such as large language models (LLMs) and multimodal architectures.
Understanding Deep Learning (Prince 2023)
Dive into Deep Learning (Zhang et al. 2023)

Interactive deep learning book with code, math, and discussions

Implemented with PyTorch, NumPy/MXNet, JAX, and TensorFlow

Adopted at 500 universities from 70 countries

Source code at d2l-ai/d2l-en. They are no longer distributing the book as a PDF, but you can build it yourself

2 Profiling and performance optimisation

Start with general Python profilers; many of them have NN affordances now.

2.1 Compiled

See edge ML for a discussion of compiled NNs.

3 Tracking experiments

See experiment tracking in ML.

4 Configuring experiments

See configuring experiments; in practice I use Hydra for everything, but pyrallis looks good too.

5 Managing axes

A lot of the time managing deep learning is remembering which axis is which.

Noam Shazeer argues that Shape Suffixes:

Designate a system of single-letter names for logical dimensions, e.g. B for batch size, L for sequence length, etc., and document it somewhere in your file/project/codebase

When known, the name of a tensor should end in a dimension-suffix composed of those letters, e.g. input_token_id_BL for a two-dimensional tensor with batch and length dimensions.

That’s all.

In combination with the found Einstein convention this seems to solve all problems I have.

However, there are more heavily-engineered alternatives. Alexander Rush argues for NamedTensor. Implementations:

6 Scaling up

See Gradient Descent at Scale.

7 Incoming

8 Pre-computed/trained models

These are all hopelessly outdated now, in the era of HuggingFace.

Caffe format:
- The Caffe Zoo has lots of nice models, pre-trained on their wiki
- Here’s a great CV one, Andrej Karpathy’s image captioner, Neuraltalk2
for the NVC dataset: — pre-trained feature model here)
AlexNet
For Lasagne: https://github.com/Lasagne/Recipes/tree/master/modelzoo
For Keras:
- convnets-keras
- vgg 16-keras
- Keras VGG and weights.

9 NN Software

This choice is becoming less relevant in the era of easy translation via llms. I have used

PyTorch
Julia
JAX
Occasionally, reluctantly, TensorFlow

I could use any of the other autodiff systems, such as…

Theano (Python) (now defunct) was a trailblazer
Torch (Lua) — in practice deprecated in favour of PyTorch
Caffe was popular for a while; have not seen it recently (MATLAB/Python)
PaddlePaddle is one of Baidu’s NN properties (Python/C++)
MindSpore is Huawei’s framework based on source transformation autodiff, targets interesting edge hardware.
JavaScript: see JavaScript machine learning

10 References

Prince. 2023. Understanding Deep Learning.

Scardapane. 2024. Alice’s Adventures in a differentiable wonderland: A primer on designing neural networks.

Zhang, Lipton, Li, et al. 2023. Dive into Deep Learning.