Automatic differentiation

2016-07-27 — 2025-10-13

Wherein automatic differentiation is described via dual numbers, Taylor‑series formulations and reverse‑mode backpropagation, and implementations such as JAX and Enzyme are noted for LLVM‑level and Python integration.

algebra
calculus
computers are awful
functional analysis
linear algebra
number crunching
optimization
Figure 1: Gradient field in python

Getting a computer to tell us the gradient of a function without resorting to finite-difference approximation or hand-coding an analytic derivative. We usually mean automatic forward- or reverse-mode differentiation, which isn’t, strictly speaking, a symbolic technique — though symbolic differentiation gets an incidental look-in, and these ideas are related.

Infinitesimal/Taylor-series formulations, the related dual-number formulations, and even fancier hyperdual formulations. Reverse-mode, a.k.a. backpropagation, versus forward-mode, etc. Computational complexity of all of the above.

There are many ways to do automatic differentiation, and I won’t attempt to comprehensively introduce the various approaches. This is a well-ploughed field. There’s lots of good material out there already with fancy diagrams and the like. Symbolic, numeric, dual/forward, backward mode… Notably, we don’t have to choose between them — e.g. we can use forward differentiation to calculate an expedient step in the middle of backward differentiation.

You might want to do this for ODE quadrature, or sensitivity analysis, for optimization, either batch or SGD, especially in neural networks, matrix factorizations, variational approximation etc. This isn’t news these days, but it took a stunningly long time to become common after its inception in the 1970s. See, e.g. Justin Domke, who called automatic differentiation the most criminally underused tool in the machine learning toolbox. (That escalated quickly.) See also a timely update by Tim Viera.

There’s a beautiful explanation of reverse-mode basics by Sanjeev Arora and Tengyu Ma. See also Mike Innes’ hands-on introduction, or his terse, opinionated introductory paper, Innes (2018), or Jingnan Shi’s excellent Automatic Differentiation: Forward and Reverse. There is a well-established terminology for sensitivity analysis discussing adjoints, e.g. Steven Johnson’s class notes, and his references (Johnson 2012; Errico 1997; Cao et al. 2003).

1 Terminology zoo

Too many words meaning the same thing, and some quirky broad usages. We need some disambiguation

2 Who invented backpropagation?

There’s an adorable cottage industry arguing about who first applied reverse-mode autodiff to networks. See, e.g. Schmidhuber’s blog post, Griewank (2012) and Schmidhuber (2015), a reddit thread and so on.

3 Computational complexity

🏗

4 Forward- versus reverse-mode

🏗

TaylorSeries.jl is an implementation of high-order automatic differentiation, as presented in the book by W. Tucker (Tucker 2011). The general idea is the following.

The Taylor series expansion of an analytical function \(f(t)\) with one independent variable \(t\) around \(t_0\) can be written as

\[ f(t) = f_0 + f_1 (t-t_0) + f_2 (t-t_0)^2 + \cdots + f_k (t-t_0)^k + \cdots, \] where \(f_0=f(t_0)\), and the Taylor coefficients \(f_k = f_k(t_0)\) are the \(k\)th normalized derivatives at \(t_0\):

\[ f_k = \frac{1}{k!} \frac{{\rm d}^k f} {{\rm d} t^k}(t_0). \]

Thus, computing the high-order derivatives of \(f(t)\) is equivalent to computing its Taylor expansion.… Arithmetic operations involving Taylor series can be expressed as operations on the coefficients.

5 Symbolic differentiation

If we have already calculated the symbolic derivative, we can of course use it as a kind of automatic derivative. It might even be faster.

We can automate calculation of symbolic derivatives. Symbolic math packages such as Sympy, MAPLE and Mathematica can all do actual symbolic differentiation, which is a different approach, but can sometimes produce the same result. I haven’t tried Sympy or MAPLE; Mathematica’s support for matrix calculus is weak, and since I usually need matrix derivatives, I haven’t automated this task.

6 In implicit targets

Long story. For use in, e.g., Implicit NN.

There’s a beautiful explanation in Blondel et al. (2021).

To do: investigate Benoît Pasquier’s (Pasquier and Primeau 2019) F-1 Method.

This package implements the F-1 algorithm […] It allows for efficient quasi-auto-differentiation of an objective function defined implicitly by the solution of a steady-state problem.

7 In ODEs

See learning ODEs and differentiable PDE solvers.

8 Method of adjoints

See recursive identification.

9 Hessians in neural nets

We’re getting better at estimating second-order derivatives in increasingly adverse circumstances. For example, see the pytorch Hessian tools.

10 As message-passing

11 Software

In decreasing order of relevance to me.

11.1 jax

jax (Python) is a successor to classic Python autograd.

JAX is Autograd and XLA, brought together for high-performance machine learning research.

I use it a lot — see JAX.

11.2 PyTorch

See pytorch.

It’s another neural-net-style library like TensorFlow, but with dynamic graph construction as in autograd.

11.3 Julia

Julia has an embarrassment of different autodiff methods (homoiconicity and introspection make this comparatively easy), and it’s not always clear what the selling points of each are.

Anyway, there’s enough going on that it needs its own page. See Julia Autodiff.

11.4 Tinygrad

tinygrad/tinygrad: You like pytorch? You like micrograd? You love tinygrad! ❤️

Despite tinygrad’s size, it is a fully featured deep learning framework.

Due to its extreme simplicity, it is the easiest framework to add new accelerators to, with support for both inference and training. If XLA is CISC, tinygrad is RISC.

11.5 Tensorflow

We’re not big fans, but it certainly works. See Tensorflow. FYI, there’s an interesting discussion of its workings in the tensorflow jacobians ticket request

11.6 Aesara

Aesara at a Glance

Aesara is a Python library that lets us define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays. It can use GPUs and perform efficient symbolic differentiation.

This is a fork of the original Theano library that’s maintained by the PyMC team.

  • A hackable, pure-Python codebase
  • Extensible graph framework suitable for rapid development of custom symbolic optimizations
  • Implements an extensible graph transpilation framework that currently provides compilation to C and JAX JITed Python functions
  • Built on top of one of the most widely-used Python tensor libraries: Theano

Aesara combines aspects of a computer algebra system (CAS) with aspects of an optimizing compiler. It can also generate customized C code for many mathematical operations. This combination of CAS with optimizing compilation is particularly useful for tasks in which complicated mathematical expressions are evaluated repeatedly and evaluation speed is critical. For situations where many different expressions are each evaluated once Aesara can minimize the amount of compilation/analysis overhead, but still provide symbolic features such as automatic differentiation.

11.7 taichi

Taichi is a physics-simulation-and-graphics-oriented library that cleverly compiles to various backends and is embedded in Python:

As a data-oriented programming language, Taichi decouples computation from data organization. For example, you can freely switch between arrays of structures (AOS) and structures of arrays (SOA), or between multi-level pointer arrays and simple dense arrays. Taichi has native support for sparse data structures, and the Taichi compiler effectively simplifies data structure accesses. This allows users to compose data organization components into complex hierarchical and sparse structures. The Taichi compiler optimizes data access.

We have developed 10 different differentiable physical simulators using Taichi, for deep learning and robotics tasks. Thanks to the built-in reverse-mode automatic differentiation system, most of these differentiable simulators are developed within only 2 hours. Accurate gradients from these differentiable simulators make controller optimization orders of magnitude faster than reinforcement learning.

11.8 Classic Python autograd

I wouldn’t use Classic Python autograd any longer. A better-supported, near drop-in replacement is jax, which is much faster and better documented.

autograd

can automatically differentiate native Python and Numpy code. It can handle a large subset of Python’s features, including loops, ifs, recursion and closures, and it can even take derivatives of derivatives of derivatives. It uses reverse-mode differentiation (a.k.a. backpropagation), which means it can efficiently take gradients of scalar-valued functions with respect to array-valued arguments. The main intended application is gradient-based optimization.

As far as I can tell, it’s deprecated in favour of jax.

autograd-forward uses forward-mode differentiation to compute Jacobian-vector products and Hessian-vector products for scalar-valued loss functions, which is useful for classic optimization.

11.9 Micrograd

Andrej Karpathy’s teaching library micrograd is a 50-line scalar autograd library that’s great for learning.

11.10 Enzyme

A generic, compiler-level AD that targets many languages.

Applying differentiable programming techniques and machine learning algorithms to foreign programs requires developers to either rewrite their code in a machine learning framework, or otherwise provide derivatives of the foreign code. This paper presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme synthesizes gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. …Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the art performance, enabling foreign code to be directly incorporated into existing machine learning workflows. (Moses and Churavy 2020)

An author says:

Basically the long story short is that Enzyme has a couple of interesting contributions:

  1. Low-level Automatic Differentiation (AD) IS possible and can be high performance
  2. By working at LLVM we get cross-language and cross-platform AD
  3. Working at the LLVM level actually can give more speedups (since it’s able to be performed after optimization)
  4. We made a plugin for PyTorch/TF that uses Enzyme to import foreign code into those frameworks with ease!

Sounds great, but I suspect that in practice there’s still a lot of work required to make this happen.

NB: I tried to find the PyTorch and TensorFlow bindings but couldn’t. Perhaps they’re discontinued? Julia bindings, Rust bindings and JAX bindings seem real though.

11.11 Theano

Mentioned for historical accuracy.

Theano (Python) supports autodiff as a basic feature and had a massive user base, but it’s now discontinued in favour of other options. See Aesara for a direct successor, and JAX, PyTorch and TensorFlow for some more widely used alternatives.

11.12 Casadi

A classic is CasADi (Python, C++, MATLAB) (Andersson et al. 2019)

a symbolic framework for numeric optimization implementing automatic differentiation in forward and reverse modes on sparse matrix-valued computational graphs. It supports self-contained C-code generation and interfaces state-of-the-art codes such as SUNDIALS, IPOPT etc. It can be used from C++, Python or Matlab

[…] CasADi is an open-source tool, written in self-contained C++ code, depending only on the C++ Standard Library.

Documentation is sparse; we should probably read the source or the published papers to understand how well this will fit our needs and, e.g. which arithmetic operations it supports.

It might be worth it for features such as graceful support for 100-fold nonlinear composition, for example. It also includes ODE sensitivity analysis (differentiating through ODE solvers), which predates lots of fancypants ‘neural ODEs’. The price we pay is a weird DSL that we must learn to use, and unlike many of its trendy peers, it has no GPU support.

11.13 KeOps

Filed under least squares, autodiff, GPS, PyTorch.

The KeOps library lets you compute reductions of large arrays whose entries are given by a mathematical formula or a neural network. It combines efficient C++ routines with an automatic differentiation engine and can be used with Python (NumPy, PyTorch), Matlab, and R.

It is perfectly suited to the computation of kernel matrix-vector products, K-nearest neighbors queries, N-body interactions, point cloud convolutions, and the associated gradients. Crucially, it performs well even when the corresponding kernel or distance matrices do not fit into the RAM or GPU memory. Compared with a PyTorch GPU baseline, KeOps provides a x10-x100 speed-up on a wide range of geometric applications, from kernel methods to geometric deep learning.

11.14 ADOL

Another classic. ADOL-C is a popular C++ automatic-differentiation library with a Python binding. It looks clunky from Python but is quite usable from C++.

11.16 ceres solver

ceres-solver (C++) — Google’s least-squares solver — seems to have some neat tricks, mostly focused on least-squares losses.

11.17 audi

autodiff, usually referred to as audi for clarity, offers lightweight automatic differentiation for MATLAB. I think MATLAB now has a whole deep learning toolkit built in, which surely supports something natively in this domain.

11.18 algopy

algopy:

allows you to differentiate functions implemented as computer programs by using Algorithmic Differentiation (AD) techniques in the forward and reverse mode. The forward mode propagates univariate Taylor polynomials of arbitrary order. Hence it is also possible to use AlgoPy to evaluate higher-order derivative tensors.

Speciality of AlgoPy is the possibility to differentiate functions that contain matrix functions as +,-,*,/, dot, solve, qr, eigh, cholesky.

We think it looks sophisticated and indeed supports differentiation elegantly; but it isn’t very actively maintained, and the project’s source code is hard to find.

12 References

Andersson, Gillis, Horn, et al. 2019. CasADi: A Software Framework for Nonlinear Optimization and Optimal Control.” Mathematical Programming Computation.
Arya, Schauer, Schäfer, et al. 2022. Automatic Differentiation of Programs with Discrete Randomness.” In.
Baydin, Atilim Gunes, and Pearlmutter. 2014. Automatic Differentiation of Algorithms for Machine Learning.” arXiv:1404.7456 [Cs, Stat].
Baydin, Atilim Gunes, Pearlmutter, Radul, et al. 2018. Automatic Differentiation in Machine Learning: A Survey.” Journal of Machine Learning Research.
Baydin, Atılım Güneş, Pearlmutter, and Siskind. 2016. Tricks from Deep Learning.” arXiv:1611.03777 [Cs, Stat].
Blondel, Berthet, Cuturi, et al. 2021. Efficient and Modular Implicit Differentiation.” arXiv:2105.15183 [Cs, Math, Stat].
Bolte, and Pauwels. 2020. A Mathematical Model for Automatic Differentiation in Machine Learning.” In Advances in Neural Information Processing Systems.
Cao, Li, Petzold, et al. 2003. Adjoint Sensitivity Analysis for Differential-Algebraic Equations: The Adjoint DAE System and Its Numerical Solution.” SIAM Journal on Scientific Computing.
Carpenter, Hoffman, Brubaker, et al. 2015. The Stan Math Library: Reverse-Mode Automatic Differentiation in C++.” arXiv Preprint arXiv:1509.07164.
Charlier, Feydy, Glaunès, et al. 2021. Kernel Operations on the GPU, with Autodiff, Without Memory Overflows.” Journal of Machine Learning Research.
Dangel, Kunstner, and Hennig. 2019. BackPACK: Packing More into Backprop.” In International Conference on Learning Representations.
Eaton. 2022. Belief Propagation Generalizes Backpropagation.”
Errico. 1997. What Is an Adjoint Model? Bulletin of the American Meteorological Society.
Fike, and Alonso. 2011. The Development of Hyper-Dual Numbers for Exact Second-Derivative Calculations.” In 49th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition.
Fischer, and Saba. 2018. Automatic Full Compilation of Julia Programs and ML Models to Cloud TPUs.” arXiv:1810.09868 [Cs, Stat].
Gallier, and Quaintance. 2022. Algebra, Topology, Differential Calculus, and Optimization Theory For Computer Science and Machine Learning.
Giles. 2008. Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation.” In Advances in Automatic Differentiation.
Gower, and Gower. 2016. Higher-Order Reverse Automatic Differentiation with Emphasis on the Third-Order.” Mathematical Programming.
Griewank. 2012. Who Invented the Reverse Mode of Differentiation? Documenta Mathematica.
Griewank, and Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation.
Haro. 2008. Automatic Differentiation Methods in Computational Dynamical Systems: Invariant Manifolds and Normal Forms of Vector Fields at Fixed Points.” IMA Note.
Hu, Anderson, Li, et al. 2020. DiffTaichi: Differentiable Programming for Physical Simulation.” In ICLR.
Hu, Li, Anderson, et al. 2019. Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures.” ACM Transactions on Graphics.
Innes. 2018. Don’t Unroll Adjoint: Differentiating SSA-Form Programs.” arXiv:1810.07951 [Cs].
Ionescu, Vantzos, and Sminchisescu. 2016. Training Deep Networks with Structured Layers by Matrix Backpropagation.”
Jatavallabhula, Iyer, and Paull. 2020. ∇SLAM: Dense SLAM Meets Automatic Differentiation.” In 2020 IEEE International Conference on Robotics and Automation (ICRA).
Johnson. 2012. Notes on Adjoint Methods for 18.335.”
Kavvadias, Papoutsis-Kiachagias, and Giannakoglou. 2015. On the Proper Treatment of Grid Sensitivities in Continuous Adjoint Methods for Shape Optimization.” Journal of Computational Physics.
Kidger, Chen, and Lyons. 2021. ‘Hey, That’s Not an ODE’: Faster ODE Adjoints via Seminorms.” In Proceedings of the 38th International Conference on Machine Learning.
Kidger, Morrill, Foster, et al. 2020. Neural Controlled Differential Equations for Irregular Time Series.” arXiv:2005.08926 [Cs, Stat].
Laue, Mitterreiter, and Giesen. 2018. Computing Higher Order Derivatives of Matrix and Tensor Expressions.” In Advances in Neural Information Processing Systems 31.
Launay, Poli, Boniface, et al. 2020. Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures.” In Advances in Neural Information Processing Systems.
Liao, Liu, Wang, et al. 2019. Differentiable Programming Tensor Networks.” Physical Review X.
Li, Wong, Chen, et al. 2020. Scalable Gradients for Stochastic Differential Equations.” In International Conference on Artificial Intelligence and Statistics.
Maclaurin, Duvenaud, and Adams. 2015. Gradient-Based Hyperparameter Optimization Through Reversible Learning.” In Proceedings of the 32nd International Conference on Machine Learning.
Margossian. 2019. A Review of Automatic Differentiation and Its Efficient Implementation.” WIREs Data Mining and Knowledge Discovery.
Mogensen, and Riseth. 2018. Optim: A Mathematical Optimization Package for Julia.” Journal of Open Source Software.
Moses, and Churavy. 2020. Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS’20.
Moses, Churavy, Paehler, et al. 2021. Reverse-Mode Automatic Differentiation and Optimization of GPU Kernels via Enzyme.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’21.
Moses, Narayanan, Paehler, et al. 2022. Scalable Automatic Differentiation of Multiple Parallel Paradigms Through Compiler Augmentation.” In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. SC ’22.
Neidinger. 2010. Introduction to Automatic Differentiation and MATLAB Object-Oriented Programming.” SIAM Review.
Neuenhofen. 2018. Review of Theory and Implementation of Hyper-Dual Numbers for First and Second Order Automatic Differentiation.” arXiv:1801.03614 [Cs].
Papoutsis-Kiachagias, Evangelos. 2013. “Adjoint Methods for Turbulent Flows, Applied to Shape or Topology Optimization and Robust Design.”
Papoutsis-Kiachagias, E. M., and Giannakoglou. 2016. Continuous Adjoint Methods for Turbulent Flows, Applied to Shape and Topology Optimization: Industrial Applications.” Archives of Computational Methods in Engineering.
Papoutsis-Kiachagias, E. M., Magoulas, Mueller, et al. 2015. Noise Reduction in Car Aerodynamics Using a Surrogate Objective Function and the Continuous Adjoint Method with Wall Functions.” Computers & Fluids.
Pasquier, and Primeau. 2019. The F-1 Algorithm for Efficient Computation of the Hessian Matrix of an Objective Function Defined Implicitly by the Solution of a Steady-State Problem.” SIAM Journal on Scientific Computing.
Prince. 2023. Understanding Deep Learning.
Rackauckas, Ma, Dixit, et al. 2018. A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions.” arXiv:1812.01892 [Cs].
Rall. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science 120.
Revels, Lubin, and Papamarkou. 2016. Forward-Mode Automatic Differentiation in Julia.” arXiv:1607.07892 [Cs].
Rumelhart, Hinton, and Williams. 1986. Learning Representations by Back-Propagating Errors.” Nature.
Scardapane. 2024. Alice’s Adventures in a differentiable wonderland: A primer on designing neural networks.
Schmidhuber. 2015. Deep Learning.” Scholarpedia.
Schüle, Simonis, Heyenbrock, et al. 2019. In-Database Machine Learning: Gradient Descent andTensor Algebra for Main Memory Database Systems.”
Shi, Hu, Lin, et al. 2024. Stochastic Taylor Derivative Estimator: Efficient Amortization for Arbitrary Differential Operators.” In.
Stapor, Fröhlich, and Hasenauer. 2018. Optimization and Uncertainty Analysis of ODE Models Using 2nd Order Adjoint Sensitivity Analysis.” bioRxiv.
Tucker. 2011. Validated numerics: a short introduction to rigorous computations.
Yao, Gholami, Keutzer, et al. 2020. PyHessian: Neural Networks Through the Lens of the Hessian.” In arXiv:1912.07145 [Cs, Math].
Zhang, Lipton, Li, et al. 2023. Dive into Deep Learning.