Automatic differentiation


Gradient field in python

Gradient field in python

Getting your computer to tell you the gradient of a function, without resorting to finite difference approximation, or coding an analytic derivative by hand. We usually mean this in the sense of automatic forward or reverse mode differentiation, which is not, as such, a symbolic technique, but symbolic differentiation gets an incidental look-in, and these ideas do of course relate.

Infinitesimal/Taylor series formulations, the related dual number formulations, and even fancier hyperdual formulations. Reverse-mode, a.k.a. Backpropagation, versus forward-mode etc. Computational complexity of all the above.

There are many ways you can do automatic differentiation, and I won’t attempt to comprehensively introduce the various approaches here. This is a well-ploughed field. There is much of good material out there already with fancy diagrams and the like. Symbolic, numeric, dual/forward, backwards mode… Notably, you don’t have to choose between them - e.g. you can use forward differentiation to calculate an expedient step in the middle of backward differentiation, for example.

You might want to do this for ODE quadrature, or sensitivity analysis, for optimisation, either batch or SGD, especially in neural networks, matrix factorisations, variational approximation etc. This is not news these days, but it took a stunningly long time to become common since its inception in the… 1970s? See, e.g. Justin Domke, who claimed automatic differentiation to be the most criminally underused tool in the machine learning toolbox. (That escalated quickly.) See also a timely update by Tim Viera.

There is a beautiful explanation of reverse-mode the basics by Sanjeev Arora and Tengyu Ma. See also Mike Innes’ hand-on introduction, or his terse, opinionated introductory paper, Innes (2018). There is a well-establish terminology for sensitivity analysis discussing adjoints, e.g. Steven Johnson’s class notes, and his references (Johnson 2012; Errico 1997; Cao et al. 2003).

Computational complexity

🏗

Forward- versus reverse-mode

🏗

Symbolic differentiation

If you have already calculated the symbolic derivative, you can of course use this as a kind of automatic derivative. It might even be faster.

Calculating symbolic derivatives can itself be automated. Symbolic math packages such as Sympy, MAPLE and Mathematica can all do actual symbolic differentiation, which is different again, but sometimes leads to the same thing. I haven’t tried Sympy or MAPLE, but Mathematica’s support for matrix calculus is weak, and since I usually need matrix derivatives, this particular task has not been automated for me.

Misc

To do: investigate unorthodox methods such as Benoît Pasquier’s F-1 Method. (source)

This package implements the F-1 algorithm described […] It allows for efficient quasi-auto-differentiation of an objective function defined implicitly by the solution of a steady-state problem.

Software

jax

jax (python) is a successor to classic python autograd.

JAX is Autograd and XLA, brought together for high-performance machine learning research.

With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.

What’s new is that JAX uses XLA to compile and run your NumPy programs on GPUs and TPUs. Compilation happens under the hood by default, with library calls getting just-in-time compiled and executed. But JAX also lets you just-in-time compile your own Python functions into XLA-optimized kernels using a one-function API, jit. Compilation and automatic differentiation can be composed arbitrarily, so you can express sophisticated algorithms and get maximal performance without leaving Python.

Dig a little deeper, and you’ll see that JAX is really an extensible system for composable function transformations. Both grad and jit are instances of such transformations. Another is vmap for automatic vectorization, with more to come.

This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying it out, reporting bugs, and letting us know what you think!

AFAICT the conda installation mode is

conda install -c conda-forge jaxlib

You don’t know jax is a popular intro.

Tensorflow

See Tensorflow. FYI there is an interesting discussion of its workings in the tensorflow jacobians ticket request

Pytorch

See pytorch.

Another neural-net style thing like tensorflow, but with dynamic graph construction as in autograd.

Julia

Julia has an embarrassment of different methods of autodiff (Homoiconicity and introspection makes this comparatively easy.) and it’s not always clear the comparative selling points of each.

Anyway, there is enough going on there that it needs its own page. See Julia Autodiff.

taichi

Taichi is a physics-simulation-and-graphics oriented library with clever compilation to various backends, embedded in python:

As a data-oriented programming language, Taichi decouples computation from data organization. For example, you can freely switch between arrays of structures (AOS) and structures of arrays (SOA), or between multi-level pointer arrays and simple dense arrays. Taichi has native support for sparse data structures, and the Taichi compiler effectively simplifies data structure accesses. This allows users to compose data organization components into complex hierarchical and sparse structures. The Taichi compiler optimizes data access.

We have developed 10 different differentiable physical simulators using Taichi, for deep learning and robotics tasks. Thanks to the built-in reverse-mode automatic differentiation system, most of these differentiable simulators are developed within only 2 hours. Accurate gradients from these differentiable simulators make controller optimization orders of magnitude faster than reinforcement learning.

Classic python autograd

autograd

can automatically differentiate native Python and Numpy code. It can handle a large subset of Python’s features, including loops, ifs, recursion and closures, and it can even take derivatives of derivatives of derivatives. It uses reverse-mode differentiation (a.k.a. backpropagation), which means it can efficiently take gradients of scalar-valued functions with respect to array-valued arguments. The main intended application is gradient-based optimization.

This is the most pythonic of the choices here; not as fast as tensorflow but simple to use and can differentiate more general things than Tensorflow.

autograd-forward will mingle forward-mode differentiation in to calculate Jacobian-vector products and Hessian-vector products for scalar-valued loss functions, which is useful for classic optimization. AFAICT there are no guarantees about computational efficiency for these.

Micrograd

Andrej Karpathy’s teaching library micrograd is a 50 line scalar autograd library from which you can learn cool things.

Theano

Mentioned for historical accuracy.

Theano, (python) supports autodiff as a basic feature and had a massive user base, although it is now discontinued in favour of the next two…

algopy

algopy:

allows you to differentiate functions implemented as computer programs by using Algorithmic Differentiation (AD) techniques in the forward and reverse mode. The forward mode propagates univariate Taylor polynomials of arbitrary order. Hence it is also possible to use AlgoPy to evaluate higher-order derivative tensors.

Speciality of AlgoPy is the possibility to differentiate functions that contain matrix functions as +,-,*,/, dot, solve, qr, eigh, cholesky.

Looks sophisticated, and indeed supports differentiation elegantly; but not so actively maintained, and the source code is hard to find.

Casadi

A classic is CasADi (Python, C++, MATLAB) (Andersson et al. 2019)

a symbolic framework for numeric optimization implementing automatic differentiation in forward and reverse modes on sparse matrix-valued computational graphs. It supports self-contained C-code generation and interfaces state-of-the-art codes such as SUNDIALS, IPOPT etc. It can be used from C++, Python or Matlab

[…] CasADi is an open-source tool, written in self-contained C++ code, depending only on the C++ Standard Library. It is developed by Joel Andersson and Joris Gillis at the Optimization in Engineering Center, OPTEC of the K.U. Leuven under supervision of Moritz Diehl. CasADi is distributed under the LGPL license, meaning the code can be used royalty-free even in commercial applications.

The toolkit generates numerical code in C which can be compiled efficiently and invoked from Python/MATLAB/etc.

Documentation is sparse; probably should read the source or the published papers to understand how well this will fit your needs and, e.g. which arithmetic operations it supports.

It might be worth it for such features as graceful support for 100-fold nonlinear composition, for example. It also includes ODE sensitivity analysis (differentiating through ODE solvers) which predates lots of fancy stuff ‘nerual ODEs”. The price you pay is a weird DSL that you must learn to use it and that unlike many of its trendy peers it has no GPU support.

ADOL

Another classic. ADOL-C is a popular C++ differentiation library with python binding. Looks clunky from python but tenable from c++.

ceres solver

ceres-solver, (C++), the google least squares solver, seems to have some good tricks, mostly focussed on least-squares losses.

audi

autodiff, which is usually referred to as audi for the sake of clarity, offers light automatic differentiation for MATLAB. I think MATLAB now has a whole deep learning toolkit built in which surely supports something natively in this domain.

Andersson, Joel A. E., Joris Gillis, Greg Horn, James B. Rawlings, and Moritz Diehl. 2019. “CasADi: A Software Framework for Nonlinear Optimization and Optimal Control.” Mathematical Programming Computation 11 (1): 1–36. https://doi.org/10.1007/s12532-018-0139-4.

Baydin, Atilim Gunes, and Barak A. Pearlmutter. 2014. “Automatic Differentiation of Algorithms for Machine Learning,” April. http://arxiv.org/abs/1404.7456.

Baydin, Atilim Gunes, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. 2015. “Automatic Differentiation in Machine Learning: A Survey,” February. http://arxiv.org/abs/1502.05767.

Baydin, Atılım Güneş, Barak A. Pearlmutter, and Jeffrey Mark Siskind. 2016. “Tricks from Deep Learning,” November. http://arxiv.org/abs/1611.03777.

Cao, Y., S. Li, L. Petzold, and R. Serban. 2003. “Adjoint Sensitivity Analysis for Differential-Algebraic Equations: The Adjoint DAE System and Its Numerical Solution.” SIAM Journal on Scientific Computing 24 (3): 1076–89. https://doi.org/10.1137/S1064827501380630.

Carpenter, Bob, Matthew D. Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt. 2015. “The Stan Math Library: Reverse-Mode Automatic Differentiation in C++.” arXiv Preprint arXiv:1509.07164. http://arxiv.org/abs/1509.07164.

Errico, Ronald M. 1997. “What Is an Adjoint Model?” Bulletin of the American Meteorological Society 78 (11): 2577–92. https://doi.org/10.1175/1520-0477(1997)078<2577:WIAAM>2.0.CO;2.

Fike, Jeffrey, and Juan Alonso. 2011. “The Development of Hyper-Dual Numbers for Exact Second-Derivative Calculations.” In 49th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition. Orlando, Florida: American Institute of Aeronautics and Astronautics. https://doi.org/10.2514/6.2011-886.

Fischer, Keno, and Elliot Saba. 2018. “Automatic Full Compilation of Julia Programs and ML Models to Cloud TPUs,” October. http://arxiv.org/abs/1810.09868.

Giles, Mike B. 2008. “Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation.” In Advances in Automatic Differentiation, edited by Christian H. Bischof, H. Martin Bücker, Paul Hovland, Uwe Naumann, and Jean Utke, 64:35–44. Berlin, Heidelberg: Springer Berlin Heidelberg. http://eprints.maths.ox.ac.uk/1079/.

Gower, R. M., and A. L. Gower. 2016. “Higher-Order Reverse Automatic Differentiation with Emphasis on the Third-Order.” Mathematical Programming 155 (1-2): 81–103. https://doi.org/10.1007/s10107-014-0827-4.

Griewank, Andreas, and Andrea Walther. 2008. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. 2nd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics.

Haro, A. 2008. “Automatic Differentiation Methods in Computational Dynamical Systems: Invariant Manifolds and Normal Forms of Vector Fields at Fixed Points.” IMA Note. http://www.maia.ub.es/~alex/admcds/admcds.pdf.

Hu, Yuanming, Luke Anderson, Tzu-Mao Li, Qi Sun, Nathan Carr, Jonathan Ragan-Kelley, and Frédo Durand. 2020. “DiffTaichi: Differentiable Programming for Physical Simulation,” February. http://arxiv.org/abs/1910.00935.

Hu, Yuanming, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019. “Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures.” ACM Transactions on Graphics 38 (6): 1–16. https://doi.org/10.1145/3355089.3356506.

Innes, Michael. 2018. “Don’t Unroll Adjoint: Differentiating SSA-Form Programs,” October. http://arxiv.org/abs/1810.07951.

Johnson, Steven G. 2012. “Notes on Adjoint Methods for 18.335,” 6.

Laue, Soeren, Matthias Mitterreiter, and Joachim Giesen. 2018. “Computing Higher Order Derivatives of Matrix and Tensor Expressions.” In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 2750–9. Curran Associates, Inc. http://papers.nips.cc/paper/7540-computing-higher-order-derivatives-of-matrix-and-tensor-expressions.pdf.

Maclaurin, Dougal, David K. Duvenaud, and Ryan P. Adams. 2015. “Gradient-Based Hyperparameter Optimization Through Reversible Learning.” In ICML, 2113–22. http://www.jmlr.org/proceedings/papers/v37/maclaurin15.pdf.

Neidinger, R. 2010. “Introduction to Automatic Differentiation and MATLAB Object-Oriented Programming.” SIAM Review 52 (3): 545–63. https://doi.org/10.1137/080743627.

Neuenhofen, Martin. 2018. “Review of Theory and Implementation of Hyper-Dual Numbers for First and Second Order Automatic Differentiation,” January. http://arxiv.org/abs/1801.03614.

Pasquier, B, and F Primeau. n.d. “The F-1 Algorithm for Efficient Computation of the Hessian Matrix of an Objective Function Defined Implicitly by the Solution of a Steady-State Problem.” SIAM Journal on Scientific Computing, 10. https://www.bpasquier.com/publication/pasquier_primeau_sisc_2019/.

Rall, Louis B. 1981. Automatic Differentiation: Techniques and Applications. Lecture Notes in Computer Science 120. Berlin ; New York: Springer-Verlag.

Revels, Jarrett, Miles Lubin, and Theodore Papamarkou. 2016. “Forward-Mode Automatic Differentiation in Julia,” July. http://arxiv.org/abs/1607.07892.

Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6088): 533–36. https://doi.org/10.1038/323533a0.

Tucker, Warwick. 2011. Validated Numerics: A Short Introduction to Rigorous Computations. Princeton: Princeton University Press. http://public.eblib.com/choice/publicfullrecord.aspx?p=683309.