Successor to Lua’s torch. Evil twin to Googles’s Tensorflow. Intermittently ascendant over Tensorflow for researchers, if not for industrial uses.

They claim aim for fancy applications which are easier in pytorch’s dynamic graph construction style, which resembles (in outcome if not implementation details) the dynamic styles of jax, most julia autodiffs, and tensorflow in “eager” mode.

PyTorch has a unique [sic] way of building neural networks: using and replaying a tape recorder.

Most frameworks such as TensorFlow, Theano, Caffe and CNTK have a static view of the world. One has to build a neural network, and reuse the same structure again and again. Changing the way the network behaves means that one has to start from scratch. [… Pytorch] allows you to change the way your network behaves arbitrarily with zero lag or overhead.

Of course the overhead is not truly zero; rather they have shifted the overhead baseline down a little. Discounting their hyperbole, it still provides relatively convenient autodiff.

The price we pay is that they have chosen different names for all the mathematical functions I use than either tensorflow or numpy, so there is pointless friction in swapping between these frameworks. Presumably that is a low-key tactic to engineer a captive audience.

Getting started

An incredible feature of pytorch is its documentation which is clear and consistent and somewhat comprehensive. That is hopefully no longer a massive advantage over Tensorflow who documentation was garbled nonsense when I was using it.

Custom functions

There is (was?) some bad advice in the manual:

nn exports two kinds of interfaces — modules and their functional versions. You can extend it in both ways, but we recommend using modules for all kinds of layers, that hold any parameters or buffers, and recommend using a functional form parameter-less operations like activation functions, pooling, etc.

Important missing information:

If my desired loss is already just a composition of existing functions, I don’t need to define a Function subclass.

And: The given options are not a binarism but two things you need to do in concert. A better summary would be:

  • If you need to have a function which is differentiable in a non-trivial way, implement a Function

  • If you need to bundle a Function with some state or differentiable parameters, additionally wrap it in a nn.Module

  • Some people claim you can also create custom layers using plain python functions. However, these don’t work as layers in an nn.Sequential model, so I’m not sure how to take this advice.

It’s just as well it’s easy to roll your own recurrent nets because the default implementations are bad

The default RNN layer is heavily optimised using cuDNN, which is sweet, but for some complicated technical reason I do not give an arse about, only have a choice of 2 activation functions, and neither of them is “linear”.

Fairly sure this is no longer true. However, the default RNNs are still a little weird, and assume a 1-dimensional state vector. a DIY approach might fix this. Recent pytorch includes JITed RNN which might even make this performant. I have not used it.

Logging training

Leveraging tensorflow’s handy diagnostic GUI, tensorboard: tensorboardX or perhaps tensorboard-logger.

Just use lighting.

Visualising network graphs

Fiddly. The official way is via ONNX.

conda install -c ezyang onnx pydot # or
pip install onnx pydot

Then one can use varions graphical model diagrams things.

brew install --cask netron # or
pip install netron
brew install graphviz

Also available, pytorchviz and tensorboardX support visualizing pytorch graphs.

pip install git+
from pytorchviz import make_dot
y = model(x)
make_dot(y, params = dict(model.named_parameters()))



amirgholami/PyHessian: PyHessian is a Pytorch library for second-order based analysis and training of Neural Networks (Yao et al. 2020)

PyHessian is a pytorch library for Hessian based analysis of neural network models. The library enables computing the following metrics:

  • Top Hessian eigenvalues
  • The trace of the Hessian matrix
  • The full Hessian Eigenvalues Spectral Density (ESD)

backpack (Dangel, Kunstner, and Hennig 2019)

Provided quantities include:

  • Individual gradients from a mini-batch
  • Estimates of the gradient variance or second moment
  • Approximate second-order information (diagonal and Kronecker approximations)

Motivation: Computation of most quantities is not necessarily expensive (often just a small modification of the existing backward pass where backpropagated information can be reused). But it is difficult to do in the current software environment.

f-dangel/backpack: BackPACK - a backpropagation package built on top of PyTorch which efficiently computes quantities other than the gradient.


Lightning is the default training/utility framework for Pytorch.

Lightning is a very lightweight wrapper on PyTorch that decouples the science code from the engineering code. It’s more of a style-guide than a framework. By refactoring your code, we can automate most of the non-research code.

To use Lightning, simply refactor your research code into the LightningModule format (the science) and Lightning will automate the rest (the engineering). Lightning guarantees tested, correct, modern best practices for the automated parts.

  • If you are a researcher, Lightning is infinitely flexible, you can modify everything down to the way .backward is called or distributed is set up.
  • If you are a scientist or production team, lightning is very simple to use with best practice defaults.

Why do I want to use lightning?

Every research project starts the same, a model, a training loop, validation loop, etc. As your research advances, you’re likely to need distributed training, 16-bit precision, checkpointing, gradient accumulation, etc.

Lightning sets up all the boilerplate state-of-the-art training for you so you can focus on the research.

This is a good introduction to the strengths and weaknesses of lightning: “Every research project starts the same, a model, a training loop, validation loop” stands in opposition to “Lightning is infinitely flexible”. >An alternative description with different emphasis “Lighting can handle many ML projects that naturally factors into a single training loop but does not help so much for other projects.”

If my project does have this flavour it is extremely useful and will do all kinds of easy parallelisation, natural code organisation and so forth. But if I am doing something like posterior sampling, or nested iterations, or optimisation at inference time, I find myself spending more time fighting the framework than working with it.

If I want the generic scaling up, I might find myself trying one of the generic solutions like Horovod.

C&C ignite?

Lightning tips

Like python itself, much messy confusion is involved in making everything seem tidy and obvious.

The Trainer class is hard to understand because it is an object defined across many files and mixins with confusing names.

One useful thing to know is that a Trainer has a model member which contains the actual LightningModule that I am training..

If I subclass ModelCheckpoint then I feel like the on_save_checkpoint method should be called as often as _save_model; but they are not. TODO: investigate this.

on_train_batch_end does not get access to anything output by the batch AFAICT, only the epoch-end callback gets the output argument filled in. See the code comments.


pytorch + Bayesian inference = pyro, a probabilistic programming framework.

Pyro launch announcment:

We believe the critical ideas to solve AI will come from a joint effort among a worldwide community of people pursuing diverse approaches. By open sourcing Pyro, we hope to encourage the scientific world to collaborate on making AI tools more flexible, open, and easy-to-use. We expect the current (alpha!) version of Pyro will be of most interest to probabilistic modelers who want to leverage large data sets and deep networks, PyTorch users who want easy-to-use Bayesian computation, and data scientists ready to explore the ragged edge of new technology.

I would argue it is the probabilistic programming framework, because it is the one that seems to have acquired critical mass while the others remain Balkanised and grumpy.


I do not do much NLP but if I did I might use the helpful utility functions in AllenNLP.

Outside of NLP there is a system of params and registrable which is very handy for defining various experiments via easy JSON config. Pro-tip from Aladair Tran: use YAML, it is even nice because it can handle comments.

That is handy, but beware, AllenNLP is a heavy dependency because it imports megabytes of code that is mostly about NLP, and some of that code has fragile dependencies. Perhaps the native pytorch lightning YAML config is enough.


One can hack the backward gradient to impose them, but why not just use one of the pre-rolled ones by
Szymon Maszke ?

DSP in pytorch

I am thinking especially of audio. Not too bad. Keunwoo Choi has some beautiful examples, e.g. Inverse STFT, Harmonic Percussive separation.

Today we have torchaudio or alternatively from Dorrien Herremans’ lab, nnAudio (Source), which is similar but has fewer dependencies.

Other libraries

Pytorch ships with a lot of included functionality, so you don’t necessarily need to wrap it in anything else. Nonetheless, you can. Here are some frameworks that I have encountered.

There are more indexed on the Pytorch ecosystem page.


Einops makes life better; it reshapes and operates on tensors in an intuitive and explicit way. It is not specific to pytorc, but the best tutorials are for pytorch:



fastai is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches. It aims to do both things without substantial compromises in ease of use, flexibility, or performance. This is possible thanks to a carefully layered architecture, which expresses common underlying patterns of many deep learning and data processing techniques in terms of decoupled abstractions. These abstractions can be expressed concisely and clearly by leveraging the dynamism of the underlying Python language and the flexibility of the PyTorch library. fastai includes:

  • A new type dispatch system for Python along with a semantic type hierarchy for tensors
  • A GPU-optimized computer vision library which can be extended in pure Python
  • An optimizer which refactors out the common functionality of modern optimizers into two basic pieces, allowing optimization algorithms to be implemented in 4–5 lines of code
  • A novel 2-way callback system that can access any part of the data, model, or optimizer and change it at any point during training
  • A new data block API


Like other deep learning frameworks, there is some basic NLP support in pytorch; see pytorch.text.

flair is a commercially-backed NLP framework.


Pump graphs to a visualisation server. Not pytorch-specific, but seems well-integrated. See visdom.


Pytorch integrates into dsome PDE solvers. See Machine learning PDEs.


pyprob: (Le, Baydin, and Wood 2017)

pyprob is a PyTorch-based library for probabilistic programming and inference compilation. The main focus of this library is on coupling existing simulation codebases with probabilistic inference with minimal intervention.

The main advantage of pyprob, compared against other probabilistic programming languages like Pyro, is a fully automatic amortized inference procedure based on importance sampling. pyprob only requires a generative model to be specified. Particularly, pyprob allows for efficient inference using inference compilation which trains a recurrent neural network as a proposal network.

In Pyro such an inference network requires the user to explicitly define the control flow of the network, which is due to Pyro running the inference network and generative model sequentially. However, in pyprob the generative model and inference network runs concurrently. Thus, the control flow of the model is directly used to train the inference network. This alleviates the need for manually defining its control flow.

The flagship application seems to be etalumis (Baydin et al. 2019) a probabilistic programming framework with emphasis AFAICT on Bayesian inverse problems.


Kornia is a differentiable computer vision library for pytorch. It includes such niceties as differentiable image warping via the grid_sample thing.


Memory leaks

Apparently you use normal python garbage collector analysis.

import torch
import gc
for obj in gc.get_objects():
        if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(
            print(type(obj), obj.size())
    except Exception as e:

See also usual python debugging. NB vs code has integrated pytorch debugging support.


Baydin, Atılım Güneş, Lei Shao, Wahid Bhimji, Lukas Heinrich, Lawrence Meadows, Jialin Liu, Andreas Munk, et al. 2019. “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale.” In arXiv:1907.03382 [cs, Stat].
Cheuk, Kin Wai, Kat Agres, and Dorien Herremans. 2019. “nnAUDIO: A Pytorch Audio Processing Tool Using 1d Convolution Neural Networks,” 2.
Dangel, Felix, Frederik Kunstner, and Philipp Hennig. 2019. “BackPACK: Packing More into Backprop.” In International Conference on Learning Representations.
Le, Tuan Anh, Atılım Güneş Baydin, and Frank Wood. 2017. “Inference Compilation and Universal Probabilistic Programming.” In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 54:1338–48. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA: PMLR.
Yao, Zhewei, Amir Gholami, Kurt Keutzer, and Michael Mahoney. 2020. “PyHessian: Neural Networks Through the Lens of the Hessian.” In arXiv:1912.07145 [cs, Math].

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.