Implementing neural nets
October 14, 2016 — January 5, 2025
1 HOWTOs
The internet is full of guides to training neural nets. Here are some selected highlights.
Michael Nielson has a free online textbook with code examples in Python. Christopher Olah’s visual explanations make many things clear.
Andrej’s popular unromantic messy guide to training neural nets in practice has a lot of tips that people tend to rediscover the hard way if they do not get them from him. (I did)
It is allegedly easy to get started with training neural nets. Numerous libraries and frameworks take pride in displaying 30-line miracle snippets that solve your data problems, giving the (false) impression that this stuff is plug and play. … Unfortunately, neural nets are nothing like that. They are not “off-the-shelf” technology the second you deviate slightly from training an ImageNet classifier.
Alice’s Adventures in a Differentiable Wonderland (Scardapane 2024)
Neural networks surround us, in the form of large language models, speech transcription systems, molecular discovery algorithms, robotics, and much more. Stripped of anything else, neural networks are compositions of differentiable primitives, and studying them means learning how to program and how to interact with these models, a particular example of what is called differentiable programming.
This primer is an introduction to this fascinating field imagined for someone, like Alice, who has just ventured into this strange differentiable wonderland. I overview the basics of optimising a function via automatic differentiation, and a selection of the most common designs for handling sequences, graphs, texts, and audios. The focus is on an intuitive, self-contained introduction to the most important design techniques, including convolutional, attentional, and recurrent blocks, hoping to bridge the gap between theory and code (PyTorch and JAX) and leaving the reader capable of understanding some of the most advanced models out there, such as large language models (LLMs) and multimodal architectures.
Dive into Deep Learning (Zhang et al. 2023)
Interactive deep learning book with code, math, and discussions
Implemented with PyTorch, NumPy/MXNet, JAX, and TensorFlow
Adopted at 500 universities from 70 countries
Source code at d2l-ai/d2l-en. They are no longer distributing the book as a PDF, but you can build it yourself
2 Profiling and performance optimisation
Start with general Python profilers; many of them have NN affordances now.
Monitor & Improve GPU Usage for Model Training on Weights & Biases
PyTorch profilers
2.1 Compiled
See edge ML for a discussion of compiled NNs.
3 Tracking experiments
4 Configuring experiments
See configuring experiments; in practice I use Hydra for everything, but pyrallis looks good too.
5 Managing axes
A lot of the time managing deep learning is remembering which axis is which.
Noam Shazeer argues that Shape Suffixes:
- Designate a system of single-letter names for logical dimensions, e.g.
B
for batch size,L
for sequence length, etc., and document it somewhere in your file/project/codebase- When known, the name of a tensor should end in a dimension-suffix composed of those letters, e.g.
input_token_id_BL
for a two-dimensional tensor with batch and length dimensions.- That’s all.
In combination with the found Einstein convention this seems to solve all problems I have.
However, there are more heavily-engineered alternatives. Alexander Rush argues for NamedTensor. Implementations:
- Native PyTorch Named Tensor
- namedtensor (PyTorch)
- labeledtensor (TensorFlow)
6 Scaling up
7 Incoming
- lab-ml/nn: 🧠 Implementations/tutorials of deep learning papers with side-by-side notes; including transformers (original, xl, switch, feedback), optimizers(adam, radam, adabelief), gans(dcgan, cyclegan), reinforcement learning (ppo, dqn), capsnet, sketch-rnn, etc.
- labml.ai Neural Networks
- ApplyingML - Papers, Guides, and Interviews with ML practitioners
- Tianyi Zhang: Interactive Debugging and Testing Support for Deep Learning
8 Pre-computed/trained models
These are all hopelessly outdated now, in the era of HuggingFace.
Caffe format:
The Caffe Zoo has lots of nice models, pre-trained on their wiki
Here’s a great CV one, Andrej Karpathy’s image captioner, Neuraltalk2
for the NVC dataset: — pre-trained feature model here)
For Lasagne: https://github.com/Lasagne/Recipes/tree/master/modelzoo
For Keras:
9 NN Software
This choice is becoming less relevant in the era of easy translation via llms. I have used
- PyTorch
- Julia
- JAX
- Occasionally, reluctantly, TensorFlow
I could use any of the other autodiff systems, such as…
- Theano (Python) (now defunct) was a trailblazer
- Torch (Lua) — in practice deprecated in favour of PyTorch
- Caffe was popular for a while; have not seen it recently (MATLAB/Python)
- PaddlePaddle is one of Baidu’s NN properties (Python/C++)
- MindSpore is Huawei’s framework based on source transformation autodiff, targets interesting edge hardware.
- JavaScript: see JavaScript machine learning