Successor to Lua’storch. Evil twin to Googles’sTensorflow. Intermittentlyascendant over Tensorflow amongst researchers, if not in industrial uses.

They claim certain fancy applications are easier in pytorch’s dynamic graph construction style, which resembles (in user experience if not implementation details) the dynamic styles of mostjulia autodiffs, andtensorflow in “eager” mode.

PyTorch has a unique

[sic]way of building neural networks: using and replaying a tape recorder.Most frameworks such as TensorFlow, Theano, Caffe and CNTK have a static view of the world. One has to build a neural network, and reuse the same structure again and again. Changing the way the network behaves means that one has to start from scratch. [… Pytorch] allows you to change the way your network behaves arbitrarily with zero lag or overhead.

Of course the overhead is not truly*zero*; rather they have shifted the user overhead around a little so that it is less annoying to change stuff compared to the version of tensorflow that was current at the time they wrote that.
Discounting the hyperbole, pytorch still provides relatively convenient, reasonably efficientautodiff and miscellaneous numerical computing, and in particular, massive community.

One surcharge we pay is that they have chosen different names and calling conventions for all the mathematical functions I use than either tensorflow or numpy, who already chose different names than one another (for no good reason as far as I know), so there is pointless friction in swapping between these frameworks. Presumably that is a tactic to engineer a captive audience? Or maybe just bad coordination. idk.

An incredible feature of pytorch is its documentation, which is clear and consistent and somewhat comprehensive. That is hopefully no longer a massive advantage over Tensorflow whose documentation was garbled nonsense when I was using it.

- main website
- source
- sundry hot tips atthe incredible pytorch
- Using pytorch while off-grid? Offline docs available atunknownue/PyTorch.docs.

jax for pytorch. Includes many useful things.

Andrej Karpathysummarises Szymon Migacz’sPyTorch Performance Tuning Guide “good quick tutorial on optimizing your PyTorch code ⏲️”

- DataLoader has bad default settings, tune
`num_workers > 0`

and default to`pin_memory = True`

- use
`torch.backends.cudnn.benchmark = True`

to autotune cudnn kernel choice- max out the batch size for each GPU to ammortize compute
- do not forget
`bias=False`

in weight layers before`BatchNorms`

, it’s a noop that bloats model- use
`for p in model. parameters (): p.grad = None`

instead of`model.zero_grad()`

- careful to disable debug APIs in prod (detect
`_anomaly/profiler/emit_nvt%/gradched`

.)- use
`DistributedDataParallel`

not`DataParallel`

, even if not running distributed- careful to load balance compute on all GPUs if variably-sized inputs or GPUs will idle
- use an apex fused optimizer (default PyTorch optim for loop iterates individual params, yikes)
- use checkpointing to recompute memory-intensive compute-efficient ops in bwd pass (e.g. activations, upsampling,…)
- use
`@torch.jit.script`

, e.g. esp to fuse long sequences of pointwise ops like in GELU

These days, trytorch.compile.

MPS backend is supported in torch 2.0 onwards. See theworked example by Thai Tran.

Lovely Tensors pretty-prints pytorch tensors in a manner more informative than the default display.

Was it really useful for you, as a human, to see all these numbers?

What is the shape? The size? What are the statistics? Are any of the values nan or inf? Is it an image of a man holding a tench?

Apparently we usenormal python garbage collector analysis.

A snippet that shows all the currently allocated Tensors:

```
import torch
import gc
for obj in gc.get_objects():
try:
if torch.is_tensor(obj) or (hasattr(obj, 'data') and torch.is_tensor(obj.data)):
print(type(obj), obj.size())
except Exception as e:
pass
```

See alsousual python debugging. NBvs code has integrated pytorch debugging support.

Nb also - any ipython line or jupyter cell returns a giant tensorit hangs around in memory.

```
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> torch.cuda.current_device()
0
>>> torch.cuda.get_device_name(0)
'GeForce GTX 950M'
```

Leveraging tensorflow’s handy diagnostic GUI,`tensorboard`

:
Now native, via`torch.utils.tensorboard`

.
See also thePyTorch Profiler documentation.

Easier: just uselighting if that fits the workflow.

Also I have seenvisdom promoted? This pumps graphs to a visualisation server. Not pytorch-specific, but seems well-integrated.

Further generic profiling and logging at theNN-in-practice notebook.

Fiddly. The official way is via ONNX.

```
conda install -c ezyang onnx pydot # or
pip install onnx pydot
```

Then one can use variousgraphical model diagrams things.

```
brew install --cask netron # or
pip install netron
brew install graphviz
```

Also available,pytorchviz andtensorboardX support visualizing pytorch graphs.

`pip install git+https://github.com/szagoruyko/pytorchviz`

```
from pytorchviz import make_dot
y = model(x)
make_dot(y, params = dict(model.named_parameters()))
```

Handy foramtric factorisation tricks

Einstein convention is supported by pytorch astorch.einsum.

Einops(Rogozhnikov 2022) is more general. It is not specific to pytorch, but the best tutorials are for pytorch:

Note that there was a hyped project ,Tensor Comprehensions in PyTorch (see thelaunch announcement) which apparently compiled the operations to CUDA kernels. It seems to bediscontinued.

A linear operator is a generalization of a matrix. It is a linear function that is defined in by its application to a vector. The most common linear operators are (potentially structured) matrices, where the function applying them to a vector are (potentially efficient) matrix-vector multiplication routines.

LinearOperator objects share (mostly) the same API as

`torch.Tensor`

objects. Under the hood, these objects use`__torch_function__`

to dispatch all efficient linear algebra operations to the`torch`

and`torch.linalg`

namespaces. […]Each of these functions will either return a

`torch.Tensor`

, or a new`LinearOperator`

object, depending on the function.

The KeOps library lets you compute reductions of large arrays whose entries are given by a mathematical formula or a neural network. It combines efficient C++ routines with an automatic differentiation engine and can be used with Python (NumPy, PyTorch), Matlab and R.

It is perfectly suited to the computation of kernel matrix-vector products, K-nearest neighbors queries, N-body interactions, point cloud convolutions and the associated gradients. Crucially, it performs well even when the corresponding kernel or distance matrices do not fit into the RAM or GPU memory. Compared with a PyTorch GPU baseline, KeOps provides a x10-x100 speed-up on a wide range of geometric applications, from kernel methods to geometric deep learning.

There is (was?) somebad advice in the manual:

`nn`

exports two kinds of interfaces — modules and their functional versions. You can extend it in both ways, but we recommend using modules for all kinds of layers, that hold any parameters or buffers, and recommend using a functional form parameter-less operations like activation functions, pooling, etc.

Important missing information:

If my desired loss is already just a composition of existing functions, I don’t need to define a`Function`

subclass.

And: The given options are not a binary choice, but two things we need to do in concert. A better summary would be:

- If you need to have a function which is differentiable in a non-trivial way, implement a
`Function`

- If you need to bundle a
`Function`

with some state or updatable parameters, additionally wrap it in a`nn.Module`

Some people claimyou can also create custom layers using plain python functions.
However, these don’t work as layers in an`nn.Sequential`

model at time of writing, so I’m not sure how to take such advice.

Hessians, stochastic gradients etc. Some of this is handled in modern times bytorch.func

There is some stochastic gradient infrastructure in pyro, in the sense of differentiation though integrals, both classicscore methods,reparameterisations and probably others. See, e.g. Storchastic(van Krieken, Tomczak, and Teije 2021).

kazukiosawa/asdfghjkl: ASDL: Automatic Second-order Differentiation (for Fisher, Gradient covariance, Hessian, Jacobian, and Kernel) Library(Osawa et al. 2023)

The library is called

ASDL, which stands forAutomaticSecond-orderDifferentiation (forFisher,Gradient covariance,Hessian,Jacobian, andKernel)Library. ASDL is a PyTorch extension for computing 1st/2nd-order metrics and performing 2nd-order optimization of deep neural networks.

Used inDaxberger et al. (2021).

backpack.pt/(Dangel, Kunstner, and Hennig 2019)

Provided quantities include:

- Individual gradients from a mini-batch
- Estimates of the gradient variance or second moment
- Approximate second-order information (diagonal and Kronecker approximations)

Motivation:Computation of most quantities is not necessarily expensive (often just a small modification of the existing backward pass where backpropagated information can be reused). But it is difficult to do in the current software environment.

Documentation mentions the following capabilities: estimate of the Variance, the Gauss-Newton Diagonal, the Gauss-Newton KFAC

Source:f-dangel/backpack.

amirgholami/PyHessian: PyHessian is a Pytorch library for second-order based analysis and training of Neural Networks(Yao et al. 2020):

PyHessian is a pytorch library for Hessian based analysis of neural network models. The library enables computing the following metrics:

- Top Hessian eigenvalues
- The trace of the Hessian matrix
- The full Hessian Eigenvalues Spectral Density (ESD)

One can hack the backward gradient to impose regularising penalties, but why not just use one of the pre-rolled ones by Szymon Maszke ?

- Welcome to pytorch-optimizers documentation! — pytorch-optimizers 2.8.0 documentation
- jettify/pytorch-optimizer: torch-optimizer -- collection of optimizers for Pytorch

Lezcano/geotorch: Constrained optimization toolkit for PyTorch(Lezcano Casado 2019).

Cooper is a toolkit for Lagrangian-based constrained optimization in Pytorch. This library aims to encourage and facilitate the study of constrained optimization problems in machine learning.

There is a lot to say here; For me at least,probabilistic programming is the killer app of pytorch; Various frameworks do clever probabilistic things, notablypyro.

torchdiffeq has much ODE stuff. *google-research/torchsde: Differentiable SDE solvers with GPU support and efficient sensitivity analysis.

Generic interpolation inxitorch

xitorch (pronounced “sigh-torch”) is a library based on PyTorch that provides differentiable operations and functionals for scientific computing and deep learning. xitorch provides analytic first and higher order derivatives automatically using PyTorch’s autograd engine. It is inspired by SciPy, a popular Python library for scientific computing.

NB, works in only one index dimension.

It’s just as well it’s easy to roll your own recurrent nets becausethe default implementations are bad

The default RNN layers are optimised using cuDNN, which is sweet.
Probably for that reasons we only have a choice of 2 activation functions, and neither of them is “linear”;
There is`tanH`

and`ReLU`

.

A DIY approach might fix this, e.g. if we subclassedRNNCell. Recent pytorch includesJITed RNN which might even make this DIY style performant. I have not used it. Everyone usestransformers these days instead, anyway.

The defaultcluster modes of python behave weirdly for pytorch tensors and especially gradients.
They hav etheir own clone of`python.multiprocessing`

.Multiprocessing best practices

There are libraries built on pytorch which make common tasks easy. I am not a fan of these because they do not seem to help my own tasks.

Lightning is a common training/utility framework for Pytorch.

Lightning is a very lightweight wrapper on PyTorch that decouples the science code from the engineering code. It’s more of a style-guide than a framework. By refactoring your code, we can automate most of the non-research code.

To use Lightning, simply refactor your research code into theLightningModule format (the science) and Lightning will automate the rest (the engineering). Lightning guarantees tested, correct, modern best practices for the automated parts.

- If you are a researcher, Lightning is infinitely flexible, you can modify everything down to the way
`.backward`

is called or distributed is set up.- If you are a scientist or production team, lightning is very simple to use with best practice defaults.

Why do I want to use lightning?Every research project starts the same, a model, a training loop, validation loop, etc. As your research advances, you’re likely to need distributed training, 16-bit precision, checkpointing, gradient accumulation, etc.

Lightning sets up all the boilerplate state-of-the-art training for you so you can focus on the research.

These last two paragraphs constitute a good introduction to the strengths*and* weaknesses of lightning: “Every research project starts the same, a model, a training loop, validation loop” stands in opposition to “Lightning is infinitely flexible”.
>An alternative description with different emphasis “Lighting can handle many ML projects that naturally factor into a single training loop but does not help so much for other projects.”

If my project*does* have such a factorisation, Lightning is extremely useful and will do all kinds of easy parallelisation, natural code organisation and so forth.
But if I am doing something likeposterior sampling, ornested iterations, or optimisation at inference time, I find myself spending more time fighting the framework than working with it.

If I want the generic scaling up, I might find myself trying one of the generic solutions likeHorovod.

Like python itself, much messy confusion is involved in making everything seem tidy and obvious.

The`Trainer`

class is hard to understand because it is an object defined across many files and mixins with confusing names.

One useful thing to know is that a`Trainer`

has a`model`

member which contains the actual`LightningModule`

that I am training..

If I subclass`ModelCheckpoint`

then I feel like the`on_save_checkpoint`

method should be called as often as`_save_model`

; but they are not.
TODO: investigate this.

`on_train_batch_end`

does not get access to anything output by the batch AFAICT, only the epoch-end callback gets the`output`

argument filled in.See the code comments.

I think Catalyst fills a similar niche to lightning? Not sure, have not used. TheCatalyst homepage blurb seems to hit the same notes as lightning with a couple of sweeteners - e.g. it claims to supportjax andtensorflow.

fastai is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches. It aims to do both things without substantial compromises in ease of use, flexibility, or performance. This is possible thanks to a carefully layered architecture, which expresses common underlying patterns of many deep learning and data processing techniques in terms of decoupled abstractions. These abstractions can be expressed concisely and clearly by leveraging the dynamism of the underlying Python language and the flexibility of the PyTorch library. fastai includes:

- A new type dispatch system for Python along with a semantic type hierarchy for tensors
- A GPU-optimized computer vision library which can be extended in pure Python
- An optimizer which refactors out the common functionality of modern optimizers into two basic pieces, allowing optimization algorithms to be implemented in 4–5 lines of code
- A novel 2-way callback system that can access any part of the data, model, or optimizer and change it at any point during training
- A new data block API

- DSP
- I am thinking especially of audio. Keunwoo Choi produced some beautiful examples, e.g.Inverse STFT,Harmonic Percussive separation.

Today we havetorchaudio or, from Dorrien Herremans’ lab,nnAudio (Source), which is similar but has fewer dependencies.

- NLP
- Like other deep learning frameworks, there is some basic NLP support in pytorch; seepytorch.text.

flair is a commercially-backed NLP framework.

I do not do much NLP but if I did I might use the helpful utility functions inAllenNLP.

allenai/allennlp: An open-source NLP research library, built on PyTorch.

- Computer vision
- In addition to the natively supported torchvision, there isKornia is a differentiable computer vision library for pytorch. It includes such niceties asdifferentiable image warping via thegrid_sample thing.

Baydin, Atılım Güneş, Lei Shao, Wahid Bhimji, Lukas Heinrich, Lawrence Meadows, Jialin Liu, Andreas Munk, et al. 2019.“Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale.” In*arXiv:1907.03382 [Cs, Stat]*.

Charlier, Benjamin, Jean Feydy, Joan Alexis Glaunès, François-David Collin, and Ghislain Durif. 2021.“Kernel Operations on the GPU, with Autodiff, Without Memory Overflows.”*Journal of Machine Learning Research* 22 (74): 1–6.

Cheuk, Kin Wai, Kat Agres, and Dorien Herremans. 2019.“nnAUDIO: A Pytorch Audio Processing Tool Using 1d Convolution Neural Networks,” 2.

Dangel, Felix, Frederik Kunstner, and Philipp Hennig. 2019.“BackPACK: Packing More into Backprop.” In*International Conference on Learning Representations*.

Daxberger, Erik, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. 2021.“Laplace Redux — Effortless Bayesian Deep Learning.” In*arXiv:2106.14806 [Cs, Stat]*.

Immer, Alexander, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Khan Mohammad Emtiyaz. 2021.“Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning.” In*Proceedings of the 38th International Conference on Machine Learning*, 4563–73. PMLR.

Immer, Alexander, Maciej Korzepa, and Matthias Bauer. 2021.“Improving Predictions of Bayesian Neural Nets via Local Linearization.” In*International Conference on Artificial Intelligence and Statistics*, 703–11. PMLR.

Krieken, Emile van, Jakub M. Tomczak, and Annette ten Teije. 2021.“Storchastic: A Framework for General Stochastic Automatic Differentiation.” In*arXiv:2104.00428 [Cs, Stat]*.

Le, Tuan Anh, Atılım Güneş Baydin, and Frank Wood. 2017.“Inference Compilation and Universal Probabilistic Programming.” In*Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 54:1338–48. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA: PMLR.

Lezcano Casado, Mario. 2019.“Trivializations for Gradient-Based Optimization on Manifolds.” In*Advances in Neural Information Processing Systems*. Vol. 32. Curran Associates, Inc.

Osawa, Kazuki, Satoki Ishikawa, Rio Yokota, Shigang Li, and Torsten Hoefler. 2023.“ASDL: A Unified Interface for Gradient Preconditioning in PyTorch.” arXiv.

Rogozhnikov, Alex. 2022.“Einops: Clear and Reliable Tensor Manipulations with Einstein-Like Notation,” 21.

Smith, Daniel G. a, and Johnnie Gray. 2018.“Opt_einsum - A Python Package for Optimizing Contraction Order for Einsum-Like Expressions.”*Journal of Open Source Software* 3 (26): 753.

Yao, Zhewei, Amir Gholami, Kurt Keutzer, and Michael Mahoney. 2020.“PyHessian: Neural Networks Through the Lens of the Hessian.” In*arXiv:1912.07145 [Cs, Math]*.

Boaz Barak has a miniature dictionary for statisticians:

I’ve always been curious about the statistical physics approach to problems from computer science. The physics-inspired algorithm survey propagation is the current champion for random 3SAT instances, statistical-physics phase transitions have been suggested as explaining computational difficulty, and statistical physics has even been invoked to explain why deep learning algorithms seem to often converge to useful local minima.

Unfortunately, I have always found the terminology of statistical physics, “spin glasses”, “quenched averages”, “annealing”, “replica symmetry breaking”, “metastable states” etc… to be rather daunting

Jaan Altosaar’sguided translation is great.

There is a deep analogy between statistical inference and statistical physics; I will give a friendly introduction to both of these fields. I will then discuss phase transitions in two problems of interest to a broad range of data sciences: community detection in social and biological networks, and clustering of sparse high-dimensional data. In both cases, if our data becomes too sparse or too noisy, it suddenly becomes impossible to find the underlying pattern, or even tell if there is one. Physics both helps us locate these phase transitions, and design optimal algorithms that succeed all the way up to this point. Along the way, I will visit ideas from computational complexity, random graphs, random matrices, and spin glass theory.

There isan overview lecture by Thomas Orton, which cites lots of the good stuff

Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems. With these two perspectives in mind, we’ll then try to use belief propagation to do inference on the stochastic block model. We’ll see some heuristic techniques for determining when BP succeeds and fails in inference, as well as some numerical simulation results of belief propagation for this problem. Lastly, we’ll talk about where this all fits into what is currently known about efficient algorithms and information theoretic barriers for the stochastic block model.

SeeIgor Carron’s “phase diagram” list, and stuff like(Oymak and Tropp 2015). Likely there are connections to Erdős-Renyi giant components and othercomplex network things inprobabilisitic graph learning. Read(Barbier 2015;Poole et al. 2016).

See alsoevolution,game theory.

Gentle intro lecture by John Baez,Biology as Information Dynamics.

See(Baez 2011;Harper 2009;Shalizi 2009;Sinervo and Lively 1996).

Neel Nanda, Tom Lieberum,A Mechanistic Interpretability Analysis of Grokking

Grokking(Power et al. 2022) is a recent phenomena discovered by OpenAI researchers, that in my opinion is one of the most fascinating mysteries in deep learning. That models trained on small algorithmic tasks like modular addition will initially memorise the training data, but after a long time will suddenly learn to generalise to unseen data.

This is a write-up of an independent research project I did into understanding grokking through the lens ofmechanistic interpretability.

My most important claim is that grokking has a deep relationship to phase changes. Phase changes, ie a sudden change in the model’s performance for some capability during training, are a general phenomena that occur when training models, that have also been observed in large models trained on non-toy tasks. For example,the sudden change in a transformer’s capacity to do in-context learning when it forms induction heads. In this work examine several toy settings where a model trained to solve them exhibits a phase change in test loss, regardless of how much data it is trained on. I show that if a model is trained on these limited data with high regularisation, then that the model shows grokking.

Achlioptas, Dimitris, and Amin Coja-Oghlan. 2008.“Algorithmic Barriers from Phase Transitions.”*arXiv:0803.2122 [Math]*, October, 793–802.

Baez, John C. 2011.“Renyi Entropy and Free Energy,” February.

Bahri, Yasaman, Jonathan Kadmon, Jeffrey Pennington, Sam S. Schoenholz, Jascha Sohl-Dickstein, and Surya Ganguli. 2020.“Statistical Mechanics of Deep Learning.”*Annual Review of Condensed Matter Physics* 11 (1): 501–28.

Baldassi, Carlo, Christian Borgs, Jennifer T. Chayes, Alessandro Ingrosso, Carlo Lucibello, Luca Saglietti, and Riccardo Zecchina. 2016.“Unreasonable Effectiveness of Learning Neural Networks: From Accessible States and Robust Ensembles to Basic Algorithmic Schemes.”*Proceedings of the National Academy of Sciences* 113 (48): E7655–62.

Barbier, Jean. 2015.“Statistical Physics and Approximate Message-Passing Algorithms for Sparse Linear Estimation Problems in Signal Processing and Coding Theory.”*arXiv:1511.01650 [Cs, Math]*, November.

Barbier, Jean, Florent Krzakala, Nicolas Macris, Léo Miolane, and Lenka Zdeborová. 2017.“Phase Transitions, Optimal Errors and Optimality of Message-Passing in Generalized Linear Models.”*arXiv:1708.03395 [Cond-Mat, Physics:math-Ph]*, August.

Braunstein, A., M. Mezard, and R. Zecchina. 2002.“Survey Propagation: An Algorithm for Satisfiability.”*arXiv:cs/0212002*, December.

Castellani, Tommaso, and Andrea Cavagna. 2005.“Spin-Glass Theory for Pedestrians.”*Journal of Statistical Mechanics: Theory and Experiment* 2005 (05): P05012.

Catoni, Olivier. 2007.“PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning.”*IMS Lecture Notes Monograph Series* 56: 1–163.

Chang, Bo, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018.“Reversible Architectures for Arbitrarily Deep Residual Neural Networks.” In*arXiv:1709.03698 [Cs, Stat]*.

Choromanska, Anna, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. 2015.“The Loss Surfaces of Multilayer Networks.” In*Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics*, 192–204.

“Complexity, Entropy and the Physics of Information: The Proceedings of the 1988 Workshop on Complexity, Entropy, and the Physics of Information.” 1990. In. Addison-Wesley Pub. Co.

Haber, Eldad, and Lars Ruthotto. 2018.“Stable Architectures for Deep Neural Networks.”*Inverse Problems* 34 (1): 014004.

Harper, Marc. 2009.“The Replicator Equation as an Inference Dynamic,” November.

Hasegawa, Yoshihiko, and Tan Van Vu. 2019.“Uncertainty Relations in Stochastic Processes: An Information Inequality Approach.”*Physical Review E* 99 (6): 062126.

Hayou, Soufiane, Arnaud Doucet, and Judith Rousseau. 2019.“On the Impact of the Activation Function on Deep Neural Networks Training.” In*Proceedings of the 36th International Conference on Machine Learning*, 2672–80. PMLR.

Krzakala, Florent, Lenka Zdeborova, Maria Chiara Angelini, and Francesco Caltagirone. n.d.“Statistical Physics of Inference and Bayesian Estimation,” 44.

Lang, Alex H., Charles K. Fisher, Thierry Mora, and Pankaj Mehta. 2014.“Thermodynamics of Statistical Inference by Cells.”*Physical Review Letters* 113 (14).

Lin, Henry W., and Max Tegmark. 2016a.“Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language.”*arXiv:1606.06737 [Cond-Mat]*, June.

———. 2016b.“Why Does Deep and Cheap Learning Work so Well?”*arXiv:1608.08225 [Cond-Mat, Stat]*, August.

Mandelbrot, Benoit. 1962.“The Role of Sufficiency and of Estimation in Thermodynamics.”*The Annals of Mathematical Statistics* 33 (3): 1021–38.

Marsland, Robert, and Jeremy England. 2018.“Limits of Predictions in Thermodynamic Systems: A Review.”*Reports on Progress in Physics* 81 (1): 016601.

Mehta, Pankaj, and David J. Schwab. 2014.“An Exact Mapping Between the Variational Renormalization Group and Deep Learning.”*arXiv:1410.3831 [Cond-Mat, Stat]*, October.

Moore, Cristopher. 2017.“The Computer Science and Physics of Community Detection: Landscapes, Phase Transitions, and Hardness.”*Bulletin of the EATCS*, February.

Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023.“Progress Measures for Grokking via Mechanistic Interpretability.” arXiv.

Oymak, Samet, and Joel A. Tropp. 2015.“Universality Laws for Randomized Dimension Reduction, with Applications.”*arXiv:1511.09433 [Cs, Math, Stat]*, November.

Pavon, Michele. 1989.“Stochastic Control and Nonequilibrium Thermodynamical Systems.”*Applied Mathematics and Optimization* 19 (1): 187–202.

Poole, Ben, Subhaneil Lahiri, Maithreyi Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. 2016.“Exponential Expressivity in Deep Neural Networks Through Transient Chaos.” In*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 3360–68. Curran Associates, Inc.

Power, Alethea, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. 2022.“Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.” arXiv.

Roberts, Daniel A., Sho Yaida, and Boris Hanin. 2021.“The Principles of Deep Learning Theory.”*arXiv:2106.10165 [Hep-Th, Stat]*, August.

Ruthotto, Lars, and Eldad Haber. 2020.“Deep Neural Networks Motivated by Partial Differential Equations.”*Journal of Mathematical Imaging and Vision* 62 (3): 352–64.

Schoenholz, Samuel S., Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. 2017.“Deep Information Propagation.” In.

Shalizi, Cosma Rohilla. 2009.“Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.”*Electronic Journal of Statistics* 3: 1039–74.

Shwartz-Ziv, Ravid, and Naftali Tishby. 2017.“Opening the Black Box of Deep Neural Networks via Information.”*arXiv:1703.00810 [Cs]*, March.

Sinervo, B., and C. M. Lively. 1996.“The Rock–Paper–Scissors Game and the Evolution of Alternative Male Strategies.”*Nature* 380 (6571): 240.

Székely, Gábor J., and Maria L. Rizzo. 2017.“The Energy of Data.”*Annual Review of Statistics and Its Application* 4 (1): 447–79.

Wolpert, David. 2017.“Constraints on Physical Reality Arising from a Formalization of Knowledge,” November.

Wolpert, David H. 2006.“Information Theory — The Bridge Connecting Bounded Rational Game Theory and Statistical Physics.” In*Complex Engineered Systems*, 262–90. Understanding Complex Systems. Springer Berlin Heidelberg.

Wolpert, David H. 2008.“Physical Limits of Inference.”*Physica D: Nonlinear Phenomena*, Novel Computing Paradigms: Quo Vadis?, 237 (9): 1257–81.

———. 2018.“Theories of Knowledge and Theories of Everything.” In*The Map and the Territory: Exploring the Foundations of Science, Thought and Reality*, 165. Cham: Springer.

———. 2019.“Stochastic Thermodynamics of Computation,” May.

Wolpert, David H., Artemy Kolchinsky, and Jeremy A. Owen. 2017.“A Space-Time Tradeoff for Implementing a Function with Master Equation Dynamics,” August.

Zdeborová, Lenka, and Florent Krzakala. 2016.“Statistical Physics of Inference: Thresholds and Algorithms.”*Advances in Physics* 65 (5): 453–552.

Not the aquatic creatures, but rather thecommand-line doohickey,
which is not as shit as the other ones.
I’m gradually transitioning to`fish`

, after accidentally losing a lot of precious data due to a quirk in`bash`

syntax. Long boring story.
It’s time for*new, exciting, different* stupid errors.

`fish`

has a strong fanbase and an opinionated design.
If you dislike those design opinions, at least you might appreciate it has a
healthy degree of sarcasm with which said opinions are expressed, which sarcasm is sorely absent from the drearily earnestnerdview of your typical gnu.org project.
You might also hold that having*any* kind of principled opinion is better than the
design-by-accumulation-of-tradition-cruft which structures command-line
shells.

- Ubuntu users can get anupdated fish PPA.
- or, macOS/Linux: install usinghomebrew

gararine reports how to make`fish`

the default shell with homebrew:

```
# check the fish path with `which fish`. In the examples below it was located at: `/opt/homebrew/bin/fish`.
# Add fish to the known shells
sudo sh -c `which fish` >> /etc/shells
# Set fish as the default shell
chsh -s `which fish`
# Add brew binaries in fish path
fish_add_path /opt/homebrew/bin
#To collect command completions for all commands run:
fish_update_completions
```

This is the most confusing thing in fish for me. It is worth reading tutorials, e.g.

**tl;dr**:

To add a path, use the utility commandfish_add_path

`fish_add_path /usr/local/bin`

To remove a path

`set PATH (string match -v /usr/local/bin $PATH)`

Adding a path? Say it’s`/usr/local/bin`

.
Put

`set -gx PATH /usr/local/bin $PATH`

in`~/.config/fish/config.fish`

,
OR

`set -U fish_user_paths /usr/local/bin $fish_user_paths`

Removing a path?

`set -gx PATH (string match -v /usr/local/bin $PATH)`

🏗 explain the difference between`$PATH`

and`$fish_user_path`

which
will depend upon me understanding how the content of`$PATH`

magically replenishes
itself and the difference between “universal” and ”global” variables.

`fish_config`

Put commands in`~/.config/fish/config.fish`

.

Aelius notes a hack to unify config:

one of the things I like about fish is how there are sane defaults and I don’t need to have any config. Which works for me, because I have no interest in learning fish syntax. I just want a helpful shell, I don’t want to have to know yet another language, and I deeply resent fish every time it doesn’t process the line of posix sh I paste into it from a wiki…

After jumping between several different shells and rewriting my

`.profile`

a number of different times for a number of different shells, I came up with a way to decouple my environment config from the shell I use. My environment always works, I don’t have to learn fish or any other syntax.I set

`/bin/dash`

as my login shell. the first line of my`~/.profile`

is`ENV=$HOME/.shinit; export ENV.`

In any interactive shell, dash executes`~/.shinit`

, which contains one line:`exec /usr/bin/fish`

.Every config item I need from my shell goes into

`~/.profile`

, written in easy, conventional posix sh— and I still get to use fish as my interactive shell, without having to go through the trouble of adopting its config to my system.

`ssh-agent`

Optiligence notes that this minor alteration should work.

`eval (ssh-agent -c)`

Alternatively, seedanhper/fish-ssh-agent orivakyb/fish_ssh_agent.

Installation:

`wget https://gitlab.com/kyb/fish_ssh_agent/raw/master/functions/fish_ssh_agent.fish -P ~/.config/fish/functions/`

Append next line to`~/.config/fish/config.fish`

`fish_ssh_agent`

You*really need* to verify that`https://gitlab.com/kyb/fish_ssh_agent/raw/master/functions/fish_ssh_agent.fish`

is not anything malicious; this is high security code.

To be more secure, we can get a known-good (IMO) version thusly:

`wget https://github.com/ivakyb/fish_ssh_agent/raw/e09c21501c20730634ab80d6bc9329335eabe065/functions/fish_ssh_agent.fish -P ~/.config/fish/functions/`

You can hack fish. Popular plugin management systems exist also. AFAICT a passable default isoh my fish

`curl -L https://get.oh-my.fish | fish`

fisher seems to be around too?

`curl -sL https://git.io/fisher | source && fisher install jorgebucaran/fisher`

It also handles`omf`

plugins, apparently.

The`omf`

manual is brusque.
See ahelpful blogpost.
I am currently running omf, but since it intrusively changed my prompt I am grumpy at it.
However, I got a better prompt,spacefish.

`omf install spacefish`

It does need the wackypowerline fonts.

`sudo apt-get install fonts-powerline`

There are various useful plugins that are not purely cosmetic; For examplefzf adds fuzzy history search.z does recency/frequency-based directory navigation.

Since I use`fish`

shell as my default but ubuntu automatically executes the`bash`

startup script`.profile`

on login,
I ran into the following errors when it tried to run the`fish`

init in a`bash`

process when I usedhomebrew:

```
bash: set: -g: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
bash: set: -g: invalid option
...
```

This is maybe related to anintermittently reported bug in homebrew.
The fix that worked for me was to change the automatically-added line in`.profile`

to be

`eval $(SHELL=bash /bin/brew shellenv)`

and to add

`eval (env SHELL=fish /bin/brew shellenv)`

to`~/.config/fish/config.fish`

.

If I used`virtualenv`

on python I would needvirtualfish to
replace python’s`virtualenvwrapper.sh`

.
Or switch to native python3`venv`

,
which is more or less the same thing but works better and doesn’t support python 2.
But if you need to support python 2 at this stage it’s because you are
in some weird enterprise environment with horrid legacy software, so hopefully you can farm this problem out to the tech support team?
Either that or you are barred from using`fish`

by policy and this is not a problem.

You need to do someextra setup to use conda with fish.

`~/miniconda3/bin/conda init fish`

Or, for older versions,

`source ~/miniconda3/etc/fish/conf.d/conda.fish`

into`~/.config/fish/config.fish`

.

(Replace`~/miniconda3/`

with the output of`conda info --root`

if you used a non-standard install location)

Wildcards are minimal, just`*`

,`**`

, andbrace expansion,`mv a.{txt,html`

.

For more sophisticated string processing, one defines custom functions (which I never actually do) oruse classic subcommands which I do all the time. Usually I want to rename files:

```
for file in (ls *.html)
mv $file (basename $file .html).txt
end
```

orexpansion via`string`

, which is harder to remember

```
for file in (ls *.html)
mv $file (string replace -r "\.html\$" .txt $file)
end
```

Or use another fancy utility likerename.

```
while true
echo "Loop forever"
end
```

```
if test $status -eq 0
echo yeah
end
```

To simply execute the second command if the first succeeded the command you want isand, which is hard to google for:

`../bin/something.sh foo; and cp foo ~/Dropbox/`

`env`

`env FOO=BAR baz.sh`

How to use nvm, rbenv, pyenv, goenv... with the fish shell

Deleting history interactively:

`history delete`

```
while true
if test (count (jobs)) -lt 5
dostuff &
else
wait # to wait for _any_ job
end
end
```

NB that only allocates 5 jobs at a time. Nifty.

- Laptop overheating/ OOM
- ⌘-tab app switcher appears on the wrong screen
- Menu bar is full
- OCR from screen
- triald
- Help I need to use a Windows keyboard
- Gatekeeper
- Desktop wallpaper
- Meta tips
- iMessage doesn’t know how country prefixes work any longer
- Desktop weblinks
- Dock disconnects hard drives when mac sleeps
- Error beeps are laptop farts
- Typing emoji
- Why do files open when I click on them too long?
`ssh-agent`

died- Preview modern image formats
- Respond to model dialogue boxes without mouse
- Trusting homebrew casks
- Convert selected text
- Usable scripting
- iTunes never finishes syncing my phone
- Creating GUIs for shell scripts
- Automatically anonymized system status
- Change default shell
- XCode
- Run out of file handles
`plist`

files are opaque binary messes- Which file is crashing/hanging
`$PID`

? - Bootloadey whodangle
- Networking from the command line
- Focus stealing
- Installing skype
- Stop gamed and other processes leaking your data and wasting precious network sockets for no reason
- macOS claims it forgot my email/contacts/calendar password again
- Using Chromium (“open-source chrome”)
- Reset semaphores
- Time machine
- Re-time stupid alarms
- Concatenate PDFs
- To file

See alsocommand lines it is tedious to remember for general POSIX commands.

⚠️ Many of the commands mentioned here are supposed to be run sudo root, and each may irremediably ruin your computer, your life, and soil everything you have ever loved. Then it might challenge you to a break dance battle, I dunno. Certainly, some of these commands have done some of that to me.

If any such adverse circumstance should eventuate, it will not be my responsibility.
The only guarantee I provide here is that*some stuff helped me at least one time*.

Many of these were filesystem-specific hacks, so I broke out those into a specificfilesystem-specific hacks section. Others were about easing the pains ofusing macOS in the low-bandwidth world ormacOS server and have their own respective notebooks.

The remaining tips are arranged so that the further you get down the list the longer it has been since I have needed to know it; the later ones probably don’t work on modern macOS.

There are some real-time monitors that let us know if weare OOM, too hot etc.

- iStatistica (Famous, good, surprisingly expensive)
- exelban/stats (Free, pretty good)
- iGlance (Free, possibly abandoned)

These do not prevent problems, but at least they*disambiguate* problems.

We can make the task switcher appear on all monitors which is IMO what everyone apart from VJs should be doing.

```
defaults write com.apple.Dock appswitcher-all-displays -bool true
killall Dock
```

macOS supports an OCR system calledLive Text which notionally means any time I can see text on the screen I should be able to copy and paste it (e.g. I should be able to use links that I see in a video chat). However, only some apps seem to support this feature, so I end up doing circuitous round trips from screenshots to supported apps. There are better ways.

Markus Schappi’smacOCR: Get any text on your screen into your clipboard.

```
# install
brew install schappim/ocr/ocr
# invoke
ocr
```

This one is an adequate solution for me because I have a terminal open 100% of the time. It can beinvoked by a shortcut which is super easy.

I have also noticedExtract Text from a Screenshot with Shortcuts which might help terminal-shy people but I cannot work out where to download it.ltxlouis devised a Siri-based solution but that just leads me to a download page for an app that claims that Siri does not work on my laptop, which is weird because Siri is RIGHT THERE. Tapping out of that.

Before Live text, there were various, more onerous, options. Seemacos ocr script hacks viaFastScripts 3

There are a lot of keys on this Windows keyboards that are useless to me. Macos has a fairly limited repertoire of ways for me to mutate the key assignment natively. How do I remap that Outlook button to be a Command key?

We can use hdiutil as perTechnical Note TN2450: Remapping Keys in macOS 10.12 Sierra, but that involves some complicated keyboard character numbers or that we find by circuitous error-prone means.

An alternate solution is to useKarabiner Elements. This seems to solve all imaginable keyboard problems.

- Karabiner-Elements
- pqrs-org/Karabiner-Elements: Karabiner-Elements is a powerful utility for keyboard customization on macOS Sierra (10.12) or later.

⚠️ I do not know how trustworthy this software is, but it seems to require high level of system access, including continuous keystroke monitoring, so be careful.

An ongoing mess of opaque trade-off between security privacy and usability.

Various scrappy fan communities make animated wallpaper.

Here is a command-line app for stitching together “dynamic“ (i.e. time-of-day-sensitive) wallpaper:mczachurski/wallpapper: Console application for creating dynamic wallpapers for macOS Mojave and newer Dynamic Wallpaper club hosts a user-generatedGallery of pretty things and alsoinstruction.

Too much effort? For pre-rolled wallpapers for the busy, see

- Unsplash Wallpapers (free)
- 24 Hour Wallpaper (AUD 15 but very pretty; but also limited selection; too much coastal panorama for my taste, as a landlubber)

Mr Bishop’sAwesome MacOs Command Line lists how to fix a great many things from the keyboard. Previouslyon github, but that is now mostly a page about his grumpiness at lazy github drive-bys contributors.

A sampling:

```
# screenshot
screencapture -T 3 -t jpg -P delayedpic.jpg
# quicklook preview
qlmanage -p /path/to/file
#Convert Audio File to iPhone Ringtone
afconvert input.mp3 ringtone.m4r -f m4af
```

Apparently we now need to manually add country prefixes to any phone numbers in our inbox that arrive without country prefixes? Here is a script toadd country code to OS X Address book entries:

```
tell application "Contacts"
repeat with eachPerson in people
repeat with eachNumber in phones of eachPerson
set theNum to (get value of eachNumber)
if (theNum does not start with "+" and theNum does not start with "61" and theNum starts with "0") then
-- log "+61" & (get text 2 thru (length of theNum) of theNum)
set value of eachNumber to "+61" & (get text 2 thru (length of theNum) of theNum)
end if
end repeat
end repeat
save
end tell
```

Note thatit only works in versions of Macos before 12.1 because that is when Apple broke scripting for the Contacts app.

Safari will do this if I drag a URL onto the desktop.
Is there a supported way of doing it from other browsers?
I have no idea, but I can save a file with`.webloc`

extension in my favouritecode_editor and it works fine.
Template:

```
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>URL</key>
<string>https://example.com/etc</string>
</dict>
</plist>
```

Thunderbolt 3 dock disconnects when MacBook sleeps:

Given that the computer sleep seems to be the problem, I turned off the "Power Nap", and it seems to be working for me now.

- System Preferences > Energy Saver
- Power Adapter tab
Uncheck"Enable Power Nap while plugged into a power adapter

- Open System Preferences.
- Click the Energy Saver icon.
- By default the 'Put hard disks to sleep when possible' option will be selected. Uncheck this option.

I hate user interface beeps in general. I REALLY hate them being played promiscuously through arbitrary bluetooth devices. I have done my best to turn them off.

I do not want them.
If they really must play themselves, they can play on my laptop’s internal speakers thank you very much.
Even better would be the laptop speakers of someone else.
Maybe a convicted criminal who needs this kind of low-level irritation as part of their state-mandated punishment?
Or Apple bluetooth devs needing operant conditioning
I*particularly* do not want beeps on audio outputs to which they are not invited and have never been invited.

MacOS audio alerts, though are rude intruders. For me, every time a new bluetooth audio device or HDMI device is connected (or reconnects because a bird flies past or an angel sighs), MacOS will use it for error beeps and miscellaneous notifications. Ths OS is a sex pest when it comes to bluetooth devices, constantly attempting to interfere with them by non-consensual error beep interference. This is incredibly uncomfortable to be around.

Or: A macos beep is a guy who gets in the elevator with me, farts, then leaves again before the door closes.
Why did that guy come here just to fart?
He did not get anything out of it.
Could he not have farted somewhere else?
I could have tolerated him farting if he had not made it weird, but now it’s weird because he came in here and farted pointedly*at* me so I’m kind of*obligated* to be offended I guess?

Here are two apps that claim to prevent macos from imposing its error beeps on random devices:

~~Audio Profile Manager~~(USD5) Audio Profile manager does not do this thing. I tried. It managed other system sounds OK but apparently cannot stop system alert beeps playing from inappropriate devices. Macos will still frantically switch to whichever bluetooth device is most irritating.- SoundSource (USD43) I am not yet psychologically prepared to spend USD43 because something in my heart feels there must be another way.

I should probably secure my mac.macOS Security and Privacy Guide.

As seen intypography.**tl;dr**`Cmd Ctrl Space`

.

I didn’t double click, honest guv.

That would bespring loading which one can turn off.

`ssh-agent`

died```
sudo launchctl unload /System/Library/LaunchDaemons/ssh.plist
sudo launchctl load -w /System/Library/LaunchDaemons/ssh.plist
```

Which worked for me. It is not ideal to require root permissions to share user keys.

`influx6`

asserts we canRestart SSH on Mac Terminal like this (also requiring root permission obviously):

```
sudo launchctl stop com.openssh.sshd
sudo launchctl start com.openssh.sshd
```

WebP support is available byWebPQuicklook

`brew install --cask WebPQuickLook`

Like many modern casks you might need to assert trust for the developer:

`brew install --cask WebPQuickLook --no-quarantine`

qlImagesize claims more features, but I have not tried it.

上的“iPreview - 强大好用的Quick Look扩展程序” enhances quicklook with a variety of image formats including AVIF, webp and eps.

dreampiggy/AVIFQuickLook: AVIF QuickLook plugin on macOS is free for AVIF.

```
brew install avifquicklook
xattr -d -r com.apple.quarantine ~/Library/QuickLook/AVIFQuickLook.qlgenerator
```

Some time I do not wish to click on buttons.
Disabled per default.There is a keyboard shortcut setting
which enables tab-navigation.**System Preferences → Keyboard → Shortcuts → Full Keyboard Access... → All Controls**.

`brew install --cask WebPQuickLook --no-quarantine`

Setting`HOMEBREW_CASK_OPTS`

couldautomatically approve some apps, although then you are casting aside a useful security feature.

`export HOMEBREW_CASK_OPTS="--no-quarantine"`

e.g. to upper case/lower case/HTML/markdown. There was this “services menu” technology which looked like it was going to do this for a while but that is no longer fashionable? 🤷♂ These days it seems one uses a third party app.

Apple has shipped a slow, incomprehensible scripting language for as long as I have been aware of them.
It’s called*Applescript*.
After much time trying to persuade it to do things for me I feel comfortable
asserting that using it will never save me time, although occasionally I execute
a one-liner viaosascript.
There are some attempts
to in-principle open up macOS’s scripting APIs to, e.g. javascript byjstalk
or python viaPyObjC
but that’s a brutally low level to do things from unless you are an app
developer, and the documentation is even worse (and then why not just use Swift?)

Some people seemed to have converged on theUIKit accessibility API as a reasonable interface to script.

On this front, tryHammerspoon/Phoenix.pyatomac does the same for python but seems designed for UI testers not for UI users. You cando it with Applescript of course, if your heart overunneth with bituminous hate.

It stays stuck on “importing photos”?
Because you don’t use iCloud, right?
For one thing, why would you voluntarily put yourprivate photos in the hands of some notoriously secretive third party?
For another, even if you wanted to, if you live in a bandwidth-poor country,
iCloud sync is not just bad, it’s*comedically* bad.
Atrociously, OS-floggingly slow and glitchy. “iClod”, let us call it.
So you sync using a cable and iTunes, of course.

Except that, every couple of days, that breaks.Here’s how to fix it. (Note that this question refers to “iPhoto”, but the same bug has been faithfully carried over and reproduced in Aperture and Photos by diligent Apple devs, and the same fix works.)

Quit iTunes

`rm -rf ~/Pictures/Photos\ Library.photoslibrary/iPod\ Photo\ Cache/`

Note that this will reset iTunes to*not* sync your images, so you might need to
reconstruct your settings.

Do that, try again.

e.g. for debugging, or for sharing with a support person.Etrecheck.

If you are reading this you are enough of a geek to need xcode:

Run this to install all of XCode or just the command line tools:

`xcode-select --install`

Xcode CLI breaks after every update. The current incantations to fix it are given in the following links:

```
xcodebuild -runFirstLaunch.
# or remove old CommandLineTools
# to force upgrade
sudo rm -rf /Library/Developer/CommandLineTools
sudo xcode-select --install
# or maybe
sudo xcode-select -s /Library/Developer/CommandLineTools
```

Hmm, who knows how this works on the latest versions?

But thetraditional advice is:

`ulimit -S -n 2048`

`wget https://github.com/wilsonmar/mac-setup/raw/master/configs/limit.maxproc.plist https://github.com/wilsonmar/mac-setup/raw/master/configs/limit.maxfiles.plist`

`plist`

files are opaque binary messesThis is how to convert them to text. If you regard XML as text.

`$PID`

?Perpetual monitoring natively:

`fs_usage $PID | grep /path/to/file`

or the classic unix way:

`lsof -r -p $PID | grep /path/to/file`

Seehere for some tips on debugging runaway/hung/exploded processes. “See what syscalls does the process actually try to do and if there are any failed ones (status is not 0)” :

`sudo dtruss -p PID`

Case study:`distnoted`

and`lsd`

.

`lsd`

runaway CPUSome indexing jobs can cause it to choke, e.g.application bundles. Also,a corrupted database, which may be fixed thusly:

```
/System/Library/Frameworks/CoreServices.framework/Frameworks/LaunchServices.framework/Support/lsregister \
-kill -r -domain local -domain system -domain user ; killall Dock
```

This is some kind of notifications daemon. I have no idea why it is out of control. It seems to be related to other processessuch as certain version of flux or bartender. but the problem only seems to occur when my backup drive is plugged in. Hmmph. File a ticket?

Michael Rourke suggests doing this every minute:

```
#!/bin/sh
# check for runaway distnoted, kill if necessary
PATH=/bin:/usr/bin
export PATH
ps -reo '%cpu,uid,pid,command' |
awk -v UID=$UID '
/distnoted agent$/ && $1 > 100.0 && $2 == UID {
system("kill -9 " $3)
}
'
```

But notethis will break backup, so maybe just don’t.

For certain problems you need to reseat the SMC and PRAM which are formally referred to collective as the`bootloadey whodangle`

.
This seemed to break often for me.
I have an inkling it is no longer a thing for modern Macs.

Symptoms include:

- machine doesn’t boot
- CPU fan going all the time
- machine is pausing lots,
- having trouble getting laid,
- *a global geopolitical malaise is leading to the ineluctable slide of civilisation into ecosocial catastrophe.

Do these things:

- Reset the SMC: Switch the computer off, then, while off, on the built-in keyboard, press the (left side) ⇧-⌃-Option keys and the power button at the same time.
- Reset the PRAM: Switch the computer off, then, while booting, press and hold the ⌥-⌘-P-R keys until the startup sound chimes again.
- Build a small pyramid over your laptop from bronze and crystals. Burn some incense and your Applecare guarantee in a brazier atop it. Surround it with small pictures of your departed ancestors. Make an offering of fruit and prayer.

- Alt
- offer a boot menu
- C
- boot up off CD/USB
- ⌘-R
- Recovery OS
- Shift
- Safe mode
- ⌘-V
- verbose mode
- ⌘-S
- single user prompt

See alsonetwork hacks from som enon-mac-specific ones.

My previous awful router would cross if I leave the house and when I come back I try to use the same DHCP lease.But it’s a one-liner to fix.

```
sudo ipconfig set en0 DHCP
ipconfig getpacket en0
```

odgard notes that some things are slows in Macos 10.15+. Fixes suggested include

```
sudo spctl developer-mode enable-terminal
sudo spctl --master-disable
```

See alsoJeff Johnson’s analysis.

`networksetup -setairportpower en0 off`

Turn on wifi on your macbook from the macOS terminal command line:

`networksetup -setairportpower en0 on`

List available wifi networks from the macOS terminal command line:

`/System/Library/PrivateFrameworks/Apple80211.framework/Versions/A/Resources/airport scan`

Join a wifi network from the macOS terminal command line:

`networksetup -setairportnetwork en0 WIFI_SSID WIFI_PASSWORD`

Find your network interface name:

`networksetup -listallhardwareports`

Oh no! did you use your computer in some wacky workplace network that DNS blocks “frivolous” websites? You need to flush it.

DNS flush command keeps changing, eh?:

```
sudo dscacheutil -flushcache
sudo killall -HUP mDNSResponder
```

More generally, avoid the problem withupgraded DNS.

*Stop stealing focus from me, slow app, I clicked on you like 30 seconds ago.*

CNET says:How to keep applications from stealing focus. But their first idea (edit application, break code signature) is not a viable idea in the modern world.

The command-line background-open still works, although if I wanted to launch apps this way I would be using Linux.

`open -ga iCal`

The Apple supported solution is for you to buy a faster computer.

Never install skype!Skype is spookware. If you must use it, use theweb version. UPDATE: is MS Teams built on Skype?

`sudo defaults write /System/Library/LaunchAgents/com.apple.gamed disabled -bool true`

or:

`launchctl unload -w /System/Library/LaunchAgents/com.apple.gamed.plist`

in reference tothis.

Then an hour later it forgets again again?

Woe! I fixed this once then I forgot how I did it.

Linkdump while I sort it out again again again:

Continuous [sic] request for the CalDAV password?Here’s one solution:

- Go to the Apple menu and choose System Preferences
- Choose the ‘iCloud’ preference pane
- Sign in to iCloud at the OS X preference panel — note if you’re already signed in but still seeing the pop-up message, you can sign out then sign back in to stop that password prompt from happening again
- Close System Preferences

Harder than it should be;
Google*really* wants you to use their furtively modified alternate branch,
Google Chrome.

I can’t even remember why I needed to do this, or how I worked it out, but geez it saved my bacon from something or other.

`ipcs -s | grep " 0x" | awk '{ print $2; }' | xargs -n 1 ipcrm -s`

`tmutil`

is the command-line app which allows you to do proper monitoring and control of the time machine service.
It includes, for example,backup statistics

There is alsoTime Machine editor which controlsvarious time machine settings including local snapshot, below. If you are using a oldish machine without a fast drive then this is useful.

I think this is largely a cosmetic issue, but macOS keeps a local copy of various versions of your files around, called local snapshots or mobile backsups depending where you look.

If, like me, you version everything in git, this is mostly annoying, but also I think harmless as they are self-cleaning. Nonetheless stuff did go wrong for me. I noticed that my spotlight indexing was stuck on the mobilebackups folder for some reason. Why was it even?

What Crap Is This: OSX’s Mobilebackups:

`sudo tmutil disablelocal`

By default, if you use notifications from Apple Calendar,
the notification for events is at 9am,
right in the middle of the first meeting of the day.
So you are half way through the report-back presentation
about your recent conference visit,
your laptop pops up**RECTAL EXAMINATION TODAY**.

You aren’t supposed to be able to change this because the thought of this cruelty is all that gets jaded Apple executives out of bed in the morning, butthere is a hack they forgot to stop.

**⚠️ Historical interst only ⚠️**. If you don't have a pre-2020 macos, usea mainstream PDF editor.

```
"/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py" \
-o PATH/TO/YOUR/MERGED/FILE.pdf \
/PATH/TO/ORIGINAL/1.pdf \
/PATH/TO/ANOTHER/2.pdf \
/PATH/TO/A/WHOLE/DIR/*.pdf
```

**UPDATE**: now that python no longer ships with macos, this python2 script is still available but tedious to execute.

If you are, e.g. concatenating chapters of a PDF book you downloaded from e.g. Springer, then you might have the creation dates in the correct order even if they are lexicographically incorrect. In that case you want (fish shell style)

```
pushd PATH/TO/PDFS
"/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py" \
-o PATH/TO/YOUR/MERGED/FILE.pdf \
(ls -rUt)
```

There is some kind of problem with spaces in pathnames if I do it from a different folder.

`homebrew`

is agood dependency manager for apps which are not packaged by the system.

Install like this:

`/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"`

On linux, I want all the libraries which are too patent-encumbered to be bundled with whatever holier-than-me distribution I run. This means codecs and other content-related apps, e.g.

`brew install libsamplerate libsndfile ffmpeg node pandoc webp libavif youtube-dl`

I would typically install the following bonus utilities

```
brew install hugo pandoc fish syncthing pyenv pipx rename \
fdupes rclone tectonic poetry bfg gdal
```

If I want to know what I currently have installed, I can do

`brew leaves`

Closely related toAutoML, in that surrogate optimisation is a popular tool for such, andadaptive design of experiment.

ClassicGP surrogate optimisation is a popular tool for model calibration, seeKennedy and O’Hagan (2001) for a classic example. More recent:Plumlee (2017).

SeeDellaporta et al. (2022) for the application ofmaximum mean discrepancy to the problem of model calibration.

Bayarri, Maria J, James O Berger, Rui Paulo, Jerry Sacks, John A Cafeo, James Cavendish, Chin-Hsu Lin, and Jian Tu. 2007.“A Framework for Validation of Computer Models.”*Technometrics* 49 (2): 138–54.

Cockayne, Jon, and Andrew B. Duncan. 2020.“Probabilistic Gradients for Fast Calibration of Differential Equation Models,” September.

Dellaporta, Charita, Jeremias Knoblauch, Theodoros Damoulas, and François-Xavier Briol. 2022.“Robust Bayesian Inference for Simulator-Based Models via the MMD Posterior Bootstrap.”*arXiv:2202.04744 [Cs, Stat]*, February.

Doherty, John. 2015.*Calibration and uncertainty analysis for complex environmental models*.

Dunbar, Oliver R. A., Andrew B. Duncan, Andrew M. Stuart, and Marie-Therese Wolfram. 2022.“Ensemble Inference Methods for Models With Noisy and Expensive Likelihoods.”*SIAM Journal on Applied Dynamical Systems* 21 (2): 1539–72.

Higdon, Dave, James Gattiker, Brian Williams, and Maria Rightley. 2008.“Computer Model Calibration Using High-Dimensional Output.”*Journal of the American Statistical Association* 103 (482): 570–83.

Huang, Yingxiang, Wentao Li, Fima Macheret, Rodney A Gabriel, and Lucila Ohno-Machado. 2020.“A Tutorial on Calibration Measurements and Calibration Models for Clinical Prediction Models.”*Journal of the American Medical Informatics Association : JAMIA* 27 (4): 621–33.

Izmailov, Pavel, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2020.“Subspace Inference for Bayesian Deep Learning.” In*Proceedings of The 35th Uncertainty in Artificial Intelligence Conference*, 1169–79. PMLR.

Kennedy, Marc C., and Anthony O’Hagan. 2001.“Bayesian Calibration of Computer Models.”*Journal of the Royal Statistical Society: Series B (Statistical Methodology)* 63 (3): 425–64.

Laloy, Eric, and Diederik Jacques. 2019.“Emulation of CPU-Demanding Reactive Transport Models: A Comparison of Gaussian Processes, Polynomial Chaos Expansion, and Deep Neural Networks.”*Computational Geosciences* 23 (5): 1193–1215.

Madan, Dilip B. 2014.“Recovering Statistical Theory in the Context of Model Calibrations.”*Journal of Financial Econometrics* 13 (2): nbu020.

McInerney, David, Mark Thyer, Dmitri Kavetski, Bree Bennett, Julien Lerat, Matthew Gibbs, and George Kuczera. 2018.“A Simplified Approach to Produce Probabilistic Hydrological Model Predictions.”*Environmental Modelling & Software* 109 (November): 306–14.

O’Hagan, A. 1978.“Curve Fitting and Optimal Design for Prediction.”*Journal of the Royal Statistical Society: Series B (Methodological)* 40 (1): 1–24.

Oakley, Jeremy E., and Benjamin D. Youngman. 2017.“Calibration of Stochastic Computer Simulators Using Likelihood Emulation.”*Technometrics* 59 (1): 80–92.

Perdikaris, Paris, and George Em Karniadakis. 2016.“Model inversion via multi-fidelity Bayesian optimization: a new paradigm for parameter estimation in haemodynamics, and beyond.”*Journal of the Royal Society, Interface* 13 (118): 20151107.

Pleiss, Geoff, Manish Raghavan, Felix Wu, Jon Kleinberg, and Kilian Q. Weinberger. 2017.“On Fairness and Calibration.” In*Advances In Neural Information Processing Systems*.

Plumlee, Matthew. 2017.“Bayesian Calibration of Inexact Computer Models.”*Journal of the American Statistical Association* 112 (519): 1274–85.

Regis, Rommel G., and Christine A. Shoemaker. 2013.“Combining Radial Basis Function Surrogates and Dynamic Coordinate Search in High-Dimensional Expensive Black-Box Optimization.”*Engineering Optimization* 45 (5): 529–55.

Sacks, Jerome, Susannah B. Schiller, and William J. Welch. 1989.“Designs for Computer Experiments.”*Technometrics* 31 (1): 41–47.

Sacks, Jerome, William J. Welch, Toby J. Mitchell, and Henry P. Wynn. 1989.“Design and Analysis of Computer Experiments.”*Statistical Science* 4 (4): 409–23.

Thiagarajan, Jayaraman J., Bindya Venkatesh, Rushil Anirudh, Peer-Timo Bremer, Jim Gaffney, Gemma Anderson, and Brian Spears. 2020.“Designing Accurate Emulators for Scientific Processes Using Calibration-Driven Deep Models.”*Nature Communications* 11 (1): 5622.

Tonkin, Matthew, and John Doherty. 2009.“Calibration-Constrained Monte Carlo Analysis of Highly Parameterized Models Using Subspace Techniques.”*Water Resources Research* 45 (12).

Dealing with it, predicting it etc.

Tim Harfordsummarizes Weitzman(M. Weitzman 2007;Martin L. Weitzman 2011;Martin L. Weitzman 2007):

It is only when we ponder the tail risk that we realise how dangerous climate change might be. Local air pollution isn’t going to wipe out the human race. Climate change probably won’t, either. But it might. When we buy insurance, it isn’t because we expect the worst, but because we recognise that the worst might happen.

Is my local neighbourhood likely to be flooded by sea level rises?Flooded by swollen rivers?Scoured by flame? If I live somewhere without a good department of meteorology, there are generic tools that will estimate this for me

These generic tools, I am told by colleagues, at only a fallback. Locally-specific modelling tat takes into account the local climate is likely superior when available, since it is able to be more detailed in the modelling.

Australia:How the climate crisis will affect you.

A more detailed but older onehere.

More usable data can bedownloaded.

I have mixed feelings about these tools. The Australian data visualisations are incorporate much high-grade locally-specific modelling, so they are probably more accurate that the generic ones. However, none of the well-visualised easily-accessible data is the stuff I actually want; e.g. a map of average summer temperature does not tell me much. A map of the number of extreme fire danger days tells me more of what I need to know; that is quite hard to find.

Projections of that sort may be found in raw data files in the*threshold* data sets.However this data is still not ideal, being badly indexed, not particularly well-explained, and confusingly named::

Paul Hawken et al,Drawdown:

Project Drawdown is the most comprehensive plan ever proposed to reverse global warming. Our organization did not make or devise the plan—we found the plan because it already exists. We gathered a qualified and diverse group of researchers from around the world to identify, research, and model the 100 most substantive, existing solutions to address climate change. What was uncovered is a path forward that can roll back global warming within thirty years. It shows that humanity has the means at hand. Nothing new needs to be invented.

Most interesting indexSolution summary by rank, which is one of the better uses cost rankings of this stuff. If their figures are accurate, this is a good pinup foreffective altruism.

Chicago’sFuture of water

“We’re going to be like the Saudi Arabia of freshwater. This is one of the best places in the world to live out global warming.”

Robert Pollin,De-growth vs a green new deal

it is in fact absolutely imperative that some categories of economic activity should now grow massively—those associated with the production and distribution of clean energy. Concurrently, the global fossil-fuel industry needs to contract massively—that is, to ‘de-grow’ relentlessly over the next forty or fifty years until it has virtually shut down. In my view, addressing these matters in terms of their specifics is more constructive in addressing climate change than presenting broad generalities about the nature of economic growth, positive or negative.

Much interesting actuarialrisk management in here. e.g.van den Bremer and van der Ploeg (2021).

Jeff Colgan, Jessica F. Green, Thomas HaleAsset Revaluation and the Existential Politics of Climate Change:

While scholars have typically modeled climate change as a global collective action challenge, we offer a dynamic theory of climate politics based on the present and future revaluation of assets. Climate politics can be understood as a contest between owners of assets that accelerate climate change, such as fossil fuel plants, and owners of assets vulnerable to climate change, like coastal property. To date, obstruction by “climate-forcing” asset holders has been a large barrier to effective climate policy. But as climate change and decarbonization policies proceed, holders of both climate-forcing and “climate-vulnerable” assets stand to lose some or even all of the value of their assets over time, and with them, the basis of their political power. This dynamic contest between opposing interests is likely to intensify in many sites of political contestation, from the subnational to transnational levels. As it does so, climate politics will become increasingly existential, potentially reshaping political alignments within and across countries. Such shifts may further undermine the LIO: as countries develop pro-climate policies at different speeds and magnitudes, they will have incentives to diverge from existing arrangements over trade and economic integration.

Theincentive design of getting individuals to coluntarily pay more for collective goods is… not ideal.

- 15 Trees does Australian community tree planting.
- Gold standard is often regarded as a best practice carbon offsetter, although I do not know what that means.
- The Green Electricity Guide

A Soil-Science Revolution Upends Plans to Fight Climate Change:

The hope was that the soil might save us. With civilization continuing to pump ever-increasing amounts of carbon dioxide into the atmosphere, perhaps plants—nature’s carbon scrubbers—might be able to package up some of that excess carbon and bury it underground for centuries or longer.

That hope has fueled increasingly ambitious climate change—mitigation plans. Researchers at the Salk Institute, for example,hope to bioengineer plants whose roots will churn out huge amounts of a carbon-rich, cork-like substance called suberin. Even after the plant dies, the thinking goes, the carbon in the suberin should stay buried for centuries. ThisHarnessing Plants Initiative is perhaps the brightest star in a crowded firmament of climate change solutions based on the brown stuff beneath our feet.

Such plans depend critically on the existence of large, stable, carbon-rich molecules that can last hundreds or thousands of years underground. Such molecules, collectively called humus, have long been a keystone of soil science; major agricultural practices and sophisticated climate models are built on them.

But over the past 10 years or so, soil science has undergone a quiet revolution, akin to what would happen if, in physics, relativity or quantum mechanics were overthrown. Except in this case, almost nobody has heard about it—including many who hope soils can rescue the climate. “There are a lot of people who are interested in sequestration who haven’t caught up yet,” saidMargaret Torn, a soil scientist at Lawrence Berkeley National Laboratory.

A new generation of soil studies powered by modern microscopes and imaging technologies has revealed that whatever humus is, it is not the long-lasting substance scientists believed it to be. Soil researchers have concluded that even the largest, most complex molecules can be quickly devoured by soil’s abundant and voracious microbes. The magic molecule you can just stick in the soil and expect to stay there may not exist.

- PredictIt runsclimate markets.
- Wolfram Schlenker, Charles Taylor,The market is betting on climate change
- On the relationship between earth system models and the labs that build them
- How to talk to a sceptic Re branding idea: “How to talk
*with*a sceptic”. - Peter Watts,Because As We All Know, The Green Party Runs the World.
- UMichigan’s course on climate action
- summary of state of denial
- My2050 calculator - create your pathway for the UK to be net zero by 2050

Mental models of climate change and their difficulties for our ape brains. 🏗

John Sterman’s course on “bathtub dynamics”

Seebushfires.

- Remember when we complained about hockey stick graphs?That complaint was premature.

From the Australian Climate Council:

- Areport called 'Hitting Home' from earlier this year, which showed that the cost of extreme weather disasters in Australia has already doubled since the 1970s.
- Areport from 2019, working with a team at University of Melbourne, which included the finding that the property market is expected to lose $571 billion in value by 2030 due to climate change and extreme weather, and will continue to lose value in the coming decades if emissions remain high.

- Rest of World’s Climate reportage
- Heatmap News — a climate change news service

Ackerman, Frank. 2017.*Worst-Case Economics: Extreme Events in Climate and Finance*. Illustrated edition. London New York: Anthem Press.

Ackerman, Frank, Stephen J DeCanio, Richard Howarth, and Kristen Sheeran. 2009.“Limitations of Integrated Assessment Models of Climate Change.”*Climatic Change* 95: 297–315.

Aono, Yasuyuki, and Keiko Kazui. 2008.“Phenological Data Series of Cherry Tree Flowering in Kyoto, Japan, and Its Application to Reconstruction of Springtime Temperatures Since the 9th Century.”*International Journal of Climatology* 28 (7): 905–14.

Bremer, Ton S. van den, and Frederick van der Ploeg. 2021.“The Risk-Adjusted Carbon Price.”*American Economic Review* 111 (9): 2782–2810.

Charpentier, Arthur, Laurence Barry, and Molly R. James. 2021.“Insurance Against Natural Catastrophes: Balancing Actuarial Fairness and Social Solidarity.”*The Geneva Papers on Risk and Insurance - Issues and Practice*, May.

Colgan, Jeff, Jessica F. Green, and Thomas Hale. 2020.“Asset Revaluation and the Existential Politics of Climate Change.” SSRN Scholarly Paper ID 3634572. Rochester, NY: Social Science Research Network.

DeCanio, Stephen J. 2003.*Economic Models of Climate Change: A Critique*. Palgrave Macmillan.

Fricke, Evan C., Alejandro Ordonez, Haldre S. Rogers, and Jens-Christian Svenning. 2022.“The Effects of Defaunation on Plants’ Capacity to Track Climate Change.”*Science*, January.

Geyer, Charles J. n.d.“Markov Chain Monte Carlo Lecture Notes,” 125.

Ghil, Michael, and Valerio Lucarini. 2020.“The Physics of Climate Variability and Climate Change.”*Reviews of Modern Physics* 92 (3): 035002.

Hsiang, Solomon M., Kyle C. Meng, and Mark A. Cane. 2011.“Civil Conflicts Are Associated with the Global Climate.”*Nature* 476 (7361): 438–41.

Keen, Steve. 2020.“The Appallingly Bad Neoclassical Economics of Climate Change.”*Globalizations* 0 (0): 1–29.

Laitner, John A, Stephen J DeCanio, and Irene Peters. 2000.“Incorporating Behavioural, Social, and Organizational Phenomena in the Assessment of Climate Change Mitigation Options.” In*Society, Behaviour, and Climate Change Mitigation*, 1–64. Dordrecht: Kluwer Academic Publishers.

Mielke, Jahel, and Gesine A. Steudle. 2018.“Green Investment and Coordination Failure: An Investors’ Perspective.”*Ecological Economics* 150 (August): 88–95.

Miguel, Edward, and Ahmed Mushfiq Mobarak. 2022.“The Economics of the COVID-19 Pandemic in Poor Countries.”*Annual Review of Economics* 14 (1): 253–85.

Oreskes, Naomi, and Erik M. Conway. 2010.*Merchants of Doubt: How a Handful of Scientists Obscured the Truth on Issues from Tobacco Smoke to Global Warming*. 1 edition. New York: Bloomsbury Press.

Ou-Yang, Chieh, Howard Kunreuther, and Erwann Michel-Kerjan. 2013.“An Economic Analysis of Climate Adaptations to Hurricane Risk in St. Lucia.”*The Geneva Papers on Risk and Insurance - Issues and Practice* 38 (3): 521–46.

Pielke, Roger, Gwyn Prins, Steve Rayner, and Daniel Sarewitz. 2007.“Climate Change 2007: Lifting the Taboo on Adaptation.”*Nature* 445: 597–98.

Pollin, Robert. 2018.“De-Growth Vs a Green New Deal.”*New Left Review*, II, no. 112: 5–25.

Proust, Katrina M, Stephen R Dovers, Barney Foran, Barry Newell, Will Steffen, and Patrick Troy. 2007.“Climate, Energy and Water: Accounting for the Links.” Canberra: Land & Water Australia.

Rolnick, David, Priya L. Donti, Lynn H. Kaack, Kelly Kochanski, Alexandre Lacoste, Kris Sankaran, Andrew Slavin Ross, et al. 2019.“Tackling Climate Change with Machine Learning.”*arXiv:1906.05433 [Cs, Stat]*, November.

Schiermeier, Quirin. 2018.“Droughts, Heatwaves and Floods: How to Tell When Climate Change Is to Blame.” News. Nature.

Schlenker, Wolfram, and Charles A Taylor. 2019.“Market Expectations About Climate Change.” Working Paper 25554. National Bureau of Economic Research.

Sterman, John D. 2011.“Communicating Climate Change Risks in a Skeptical World.”*Climatic Change* 108 (4): 811–26.

Sterman, John D., and Linda Booth Sweeney. 2007.“Understanding Public Complacency about Climate Change: Adults’ Mental Models of Climate Change Violate Conservation of Matter.”*Climatic Change* 80 (3-4): 213–38.

Sterman, John D, and Linda B Sweeney. 2002.“Cloudy Skies: Assessing Public Understanding of Global Warming.”*System Dynamics Review* 18: 207–40.

Thiébaux, H. J., and F. W. Zwiers. 1984.“The Interpretation and Estimation of Effective Sample Size.”*Journal of Climate and Applied Meteorology* 23 (5): 800–811.

Weitzman, Martin. 2007.“Structural Uncertainty and the Value of Statistical Life in the Economics of Catastrophic Climate Change.” Working Paper 13490. National Bureau of Economic Research.

Weitzman, Martin L. 2007.“A Review of The Stern Review on the Economics of Climate Change.”*Journal of Economic Literature*, 22.

Weitzman, Martin L. 2011.“Fat-Tailed Uncertainty in the Economics of Catastrophic Climate Change.”*Review of Environmental Economics and Policy* 5 (2): 275–92.

Yglesias, Matthew. 2012.*The Rent Is Too Damn High: What To Do About It, And Why It Matters More Than You Think*. Simon & Schuster.

Some public goods I long for can best be achieved by outsourcing, i.e. I give someone else money to achieve them for us all. That is what charitable donation is. I mention my specific donations here because

- I believe in normalising donations as part of a healthy society.
(This is a contingent stance; many of the causes I donate to have the goal of
*reducing the need for charitable donation*, which I think is better than having a society which relies upon affluent guilt.) - I hope that by highlighting the causes I donate to, I will encourage others to donate to them, so the publicity is useful leverage.
- For my own reference, I want a centralised list of who I am donating to so that I can
*stop*my donations if I decide the recipient is no longer the best place to send my money. Maybe if donating to saving the planet becomes the newconspicuous consumption, then not only will I look fancy, I’ll even have a planet to look fancy on. - Maybe you will have better ideas about whom I should donate to and will engage constructively on that topic to improve my strategy.
- I hope that you will think I am a nice person for giving money to strangers.
(Maybe I even
*am*a nice guy? You should probably demand more stringent proof if that is a matter of import to you.)

Good.

Listing organisations here should not be taken as my personal endorsement of any individual tactical decision made by any of the organisations or individuals mention, nor my blanket support of all positions they may adopt.

Unlike, say, classic, marginalist-style,Effective Altruist organisations donor lists, there is little mention here of mosquito nets. I am more interested in moon-shots and hail-mary punts and other high-variance strategies, which the Open Philanthropy people bill ashits-based Giving, whereas the default EA strategy islow-variance.

I am interested especially in*organisations which aim to change the system* to enable us humans to solve problems for ourselves, i.e. disruptive changers.
That is, I mostly give money to lobbyists, builders of capacity, and builders of tools.
This is, IMO, a higher-risk, higher-expected-reward strategy than (important, useful) concrete certainties like mosquito nets, and also, TBH, one that I more directly benefit from personally.
Enlightened mutual interest is kind of my whole thing though.
Further, we are at a point in human history where high-risk high-reward is pretty much the only wise strategy.
Hail-mary bets all the way.

As a side order, I give money to some creators whose work I enjoy. This amount of money is somewhat smaller than the donations to political activity.

For reference, my current donation level is 4% of my income. Unless there is an exceptional burning emergency or matching funding from a donor, I attempt to give my donations as regular recurring payments, to provide budgeting certainty to both the organisation and to myself. Also, organisations that need to raise funding by alarmism learn some unhealthy habits around crying wolf.

Because of my work and relationships, I have private information that suggests that some of these organisations are high leverage.

- Original power: Australian-Indigenous-led campaign for self-determination through clean energy. My research into them suggests that they have high leverage to improve both indigenous self-determination outcomes and the politics of energy in Australia more broadly.
- The Guardian because they are a (relatively) credible dissident media organisation inMurdochistan.
- Effective Altruism Australia, because even though I am not impressed with theirlow variance portfolio, it’s tax-deductible so…
- Better Renting becauseland economics in Australia is cooked, and I believe under-addressed.
- RE-Alliance for social licence for renewable energy Australia. Non-tax-deductible.
- Tomorrow Movement for youth climate voices. Tax deductible.
~~Richard Boyle, whose persecution by the Australian government for whistleblowing on government activity is one of several test cases in theAustralian authoritarian turn.~~No longer taking donations.

The next few are about private and/or open source computing infrastructure.

- Thunderbird, because it is the closest to being anadequate linux email client.
- Whonix because it is a slightly esoteric bit of harm reduction for avoiding the chilling effects ofstate surveillance of citizen, so I figure they need it more than the less esoteric but still importanttails.
- Manyverse for exactly the same reason.
- Zotero who buildamazing infrastructure for mycitations.

Various creators on Patreon I can’t work out how to link to*en masse*:

- Michael Betancourt because he is a giant nerd doing useful nerdy things for other giant nerds like me, such as explaining diffeomorphisms.
- End of shi(f)t report the most harrowing and clever nursing blog I have ever read.
- Oglaf because I like fancy dick jokes.
- Emiliano Heyns who builds extra useful infrastructure for mycitations
- Marie Brennan whose books I love, why not?
- Dave Kellett
- Laszlo Montgomery,China podcaster whose idiosyncratic style, and endless fascination I find addictive and cheering. Plus also the history of China is fascinating.

For whom I do work using my specialised skills.

~~Progtech,~~the Progressive Tech network. I attempted to volunteer teaching data-science-for-campaigning to these folks, but they never allocated me to anyone. Progtech: you folks are welcome to get back in touch.

If your cause is righteous, you are welcome to pitch to me for a slice of my volunteer time.Send me a short paragraph making a case for what it will help (I tend to favour climate- and conflict-risk mitigation causes) and why you think it will be a high-leverage use of my time towards that goal.

A special category of donations, which I do not think is high leverage*directly*, but which might seed discussion.

Australia has a complicated and generally horrifying history regarding the treatment of the first people to live here. Putting that right is going to be long, slow and complicated and I am not expert in the nuances of policy to make it better. The institutions that might better the autonomy of Aboriginal people in Australia are not strong, at least not everywhere. It is not always clear how to act to make the best difference, in the sometimes-dirty power politics.

However, when it comes to paying Aboriginal people for the use of land in Australia, I think things are relatively straightforward. I am not Aboriginal, but I have common cause with the Aboriginal people of Sydney in struggling against thenascent landed caste system that is choking us all, and furrther, Aboriginal people are on average worse off than I because of it.

In institutional terms, I argue that we need a threat of credible punishment for screwing people over in order to disincentivise future screwing over of people.
I do not have the power to directly institute such a system.
I can make a small positive difference to Aboriginal people by paying some cash back to some of the people whose historical mistreatment has most directly benefited me, and publicising it here.
Also, if I can increase the diversity and accessibility of housing to Aboriginal people in urban Australia it will also help me, personally, by getting me a more interesting, diverse and affordable place to live; we can think of that as a*gentrification levy* if we want.

ThePay the rent movement is all about this kind of idea.

So, to whom should I pay the rent to accomplish these goals of starting-conversations-about-fixing-equity-for-Australians? Ideally thereshould be an equitable system for collecting and allocating land rents to aboriginal land custodians, but that institution does not exist at the moment. This is a substantial problem — we do not have good incentive alignment in the institutions we cobble together to cover that gap and this voluntary payment does not well approximate the redress that I am suggesting would be ideal.

OK, but I am not the guy to build those hypothetical institutions. What do I do?
The governing Aboriginal land council where I live, theMetropolitan Local Aboriginal Land Council does not take donations.^{1}
For now, the best I have isOriginal power, an Australian-Indigenous-led campaign for self-determination through clean energy.
I will revisit this choice periodically.

Here is a handy reference tofinding the traditional owners of land in Australia.

but

*do*consider hiring out their campsite; it is lovely. Also, they are champs.↩︎

Specifically\((\mathrm{K}=\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2\mathrm{I})\) where\(\mathrm{K}\in\mathbb{R}^{N\times N}\) and\(\mathrm{Z}\in\mathbb{R}^{N\times D}\) with\(D\ll N\). A workhorse. We might get a cool low rank decompositions like this frommatrix factorisation, but they arise everywhere. To pick one example,Gaussian processes.

Lots of fun tricks, mostly because of the Woodbury Identity. SeeKen Tay’s intro on that.

Specifically, solving\(\mathrm{X}=\mathrm{K}\mathrm{B}\) for\(\mathrm{X}\) and, in particular solving it efficiently, in the sense that we

- exploit the computational efficiency of the low rank structure of\(\mathrm{K}\) so that it costs less than\(\mathcal{O}(D^3M)\) to compute\(\mathrm{K}^{-1}\mathrm{B}\).
- avoid every forming the explicit inverse matrix\(\mathrm{K}^{-1}\) which requires storage\(\mathcal{O}(D^2)\).

This is possible using the following useful trick. Applying the Woodbury identity,\[\begin{align*} \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}-\sigma^{-4} \mathrm{Z}\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\top} \mathrm{Z}\right)^{-1} \mathrm{Z}^{\top} \end{align*}\] we compute the lower Cholesky decomposition\(\mathrm{L} \mathrm{L}^{\top}=\left(\mathrm{I}+\sigma^{-2}\mathrm{Z}^{\top} \mathrm{Z}\right)^{-1}\) at a cost of\(\mathcal{O}(N^3)\), and define\(\mathrm{R}=\sigma^{-2}\mathrm{Z} \mathrm{L}\). We use this to discover\[ \mathrm{K}^{-1}=\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\top}, \] and we may thus compute the solution by matrix multiplication\[\begin{aligned} \mathrm{K}^{-1}\mathrm{B} &=\left(\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2 \mathrm{I}\right)^{-1}\mathrm{B}\\ &=\left(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\top}\right)\mathrm{B}\\ &=\underbrace{\sigma^{-2}\mathrm{B}}_{D \times M} - \underbrace{\mathrm{R}}_{D\times N} \underbrace{\mathrm{R}^{\top}\mathrm{B}}_{N\times M} \end{aligned}\]

The solution of the linear system is available at cost which looks something like\(\mathcal{O}\left(N^2 D + NDM +N^3\right)\) (hmm, should check that). Generalising from\(\sigma^2\mathrm{I}\) to arbitrary diagonal is easy.

TODO: discuss positive-definiteness.

TOD: Centered version(Ameli and Shadden 2023;D. Harville 1976;D. A. Harville 1977;Henderson and Searle 1981).

Nakatsukasa (2019) observes that

The nonzero eigenvalues of\(\mathrm{Y} \mathrm{Z}\) are equal to those of\(\mathrm{Z} \mathrm{Y}\) : an identity that holds as long as the products are square, even when\(\mathrm{Y}, \mathrm{Z}\) are rectangular. This fact naturally suggests an efficient algorithm for computing eigenvalues and eigenvectors of a low-rank matrix\(\mathrm{K}=\mathrm{Y} \mathrm{Z}\) with\(\mathrm{Y}, \mathrm{Z}^{\top} \in \mathbb{C}^{N \times r}, N \gg r\) : form the small\(r \times r\) matrix\(\mathrm{Z} \mathrm{Y}\) and find its eigenvalues and eigenvectors. For nonzero eigenvalues, the eigenvectors are related by\(\mathrm{Y} \mathrm{Z} v=\lambda v \Leftrightarrow \mathrm{Z} \mathrm{Y} w=\lambda w\) with\(w=\mathrm{Z} v\)[…]

Concerting that to the case of\(\mathrm{K}\) we have that the nonzero eigenvalues of\(\mathrm{Z} \mathrm{Z}^{\top}\) are equal to those of\(\mathrm{Z}^{\top} \mathrm{Z}\); to compute eigenvalues and eigenvectors of a low-rank matrix\(X=\mathrm{Z} \mathrm{Z}^{\top}\): form the small\(N \times N\) matrix\(\mathrm{Z}^{\top} \mathrm{Z}\) and find its eigenvalues and eigenvectors. For nonzero eigenvalues, the eigenvectors are related by\(\mathrm{Z} \mathrm{Z}^{\top} v=\lambda v \Leftrightarrow \mathrm{Z}^{\top} \mathrm{Z} w=\lambda w\) with\(w=\mathrm{Z}^{\top} v\).

A classic piece of lore is cheap eigendecomposition of\(\mathrm{K}\) by exploiting the low rank structure and SVD. I have no idea who invented this, but here goes. First we calculate the SVD of\(\mathrm{Z}\) to obtain\(\mathrm{Z}=\mathrm{U}\mathrm{S}\mathrm{V}^{\top}\), where\(\mathrm{U}\in\mathbb{R}^{D\times N}\) and\(\mathrm{V}\in\mathbb{R}^{N\times N}\) are orthogonal and\(\mathrm{S}\in\mathbb{R}^{N\times N}\) is diagonal. Then we may write\[ \begin{aligned} \mathrm{K} &= \mathrm{Z} \mathrm{Z}^{\top} + \sigma^2 \mathrm{I} \\ &= \mathrm{U} \mathrm{S} \mathrm{V}^{\top} \mathrm{V} \mathrm{S} \mathrm{U}^{\top} + \sigma^2 \mathrm{I} \\ &= \mathrm{U} \mathrm{S}^2 \mathrm{U}^{\top} + \sigma^2 \mathrm{I} \end{aligned} \] Thus the top\(N\) eigenvalues of\(\mathrm{K}\) are\(\sigma^2+s_n^2\), and the corresponding eigenvectors are\(\boldsymbol{u}_n\). The remaining eigenvalues are\(\sigma^2\), and the corresponding eigenvectors are an arbitrary subset in the complement of the\(\mathrm{U}\) eigenvectors.

Louis Tiao,Efficient Cholesky decomposition of low-rank updates summarisingSeeger (2004).

Specifically,\((\mathrm{Y} \mathrm{Y}^{\top}+\sigma^2\mathrm{I})(\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2\mathrm{I})\). Are low rank products cheap?

\[
\begin{aligned}
(\mathrm{Y} \mathrm{Y}^{\top}+\sigma^2\mathrm{I})(\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2\mathrm{I})
&=\mathrm{Y} \mathrm{Y}^{\top} \mathrm{Z} \mathrm{Z}^{\top}+\sigma^2\mathrm{Y} \mathrm{Y}^{\top}+\sigma^2\mathrm{Z} \mathrm{Z}^{\top}+\sigma^4\mathrm{I}\\
&=\mathrm{Y} (\mathrm{Y}^{\top} \mathrm{Z} )\mathrm{Z}^{\top}+\sigma^2\mathrm{Y} \mathrm{Y}^{\top}+\sigma^2\mathrm{Z} \mathrm{Z}^{\top}+\sigma^4\mathrm{I}
\end{aligned}
\]
which is still a*sum* of low-rank approximations.
At this point it might be natural to consider atensor decomposition.

Suppose the low-rank inverse factors of\(\mathrm{Y}\) and\(\mathrm{X}\) are, respectively,\(\mathrm{R}\) and\(\mathrm{C}\). Then we have

\[ \begin{aligned} &(\mathrm{Y} \mathrm{Y}^{\top}+\sigma^2\mathrm{I})^{-1}(\mathrm{Z} \mathrm{Z}^{\top}+\sigma^2\mathrm{I})^{-1}\\ &=(\sigma^{-2}\mathrm{I}-\mathrm{R} \mathrm{R}^{\top})(\sigma^{-2}\mathrm{I}-\mathrm{C} \mathrm{C}^{\top})\\ &=\sigma^{-4}\mathrm{I}-\sigma^{-4}\mathrm{R} \mathrm{R}^{\top}-\sigma^{-4}\mathrm{C} \mathrm{C}^{\top}+\sigma^{-4}\mathrm{R} (\mathrm{R}^{\top}\mathrm{C}) \mathrm{C}^{\top}\\ \end{aligned} \]

Once again, cheap to evaluate, but not so obviously nice.

Suppose we want to measure theFrobenius distance between\(\mathrm{K}_{\mathrm{U},\sigma^2}\) and\(\mathrm{K}_{\mathrm{R},\gamma^2}\). We recall that we might expect things to be nice if they are exactly low rank because, e.g.\[ \begin{aligned} \|\mathrm{U}\mathrm{U}^{\top}\|_F^2 =\operatorname{tr}\left(\mathrm{U}\mathrm{U}^{\top}\mathrm{U}\mathrm{U}^{\top}\right) =\|\mathrm{U}^{\top}\mathrm{U}\|_F^2 \end{aligned} \] How does it come out as a distance between two low-rank-plus-diagonaly matrices. The answer may be found without forming the full matrices. For compactness, we define\(\delta^2=\sigma^2-\gamma^2\).\[ \begin{aligned} &\|\mathrm{U}\mathrm{U}^{\dagger}+\sigma^{2}\mathrm{I}-\mathrm{R}\mathrm{R}^{\dagger}+\gamma^{2}\mathrm{I}\|_F^2\\ &=\left\|\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger} + \delta^2\mathrm{I}\right\|_{F}^2\\ &=\left\|\mathrm{U}\mathrm{U}^{\dagger}+i\mathrm{R}i\mathrm{R}^{\dagger} + \delta^2\mathrm{I} \right\|_{F}^2\\ &=\left\|\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I} \right\|_{F}^2\\ &=\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I},\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger} + \delta^2\mathrm{I} \right\rangle_{F}\\ &=\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger},\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right\rangle_{F} +\left\langle\delta^2\mathrm{I}, \delta^2\mathrm{I} \right\rangle_{F}\\ &\quad+2\operatorname{Re}\left(\left\langle\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}, \delta^2\mathrm{I} \right\rangle_{F}\right)\\ &=\operatorname{Tr}\left(\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right) +\delta^4D\\ &\quad+2\delta^2\operatorname{Re}\operatorname{Tr}\left(\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}\begin{bmatrix} \mathrm{U} &i\mathrm{R}\end{bmatrix}^{\dagger}\right)\\ &=\operatorname{Tr}\left(\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger}\right)\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger}\right)\right) +\delta^4D\\ &\quad+2\delta^2\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger} -\mathrm{R}\mathrm{R}^{\dagger}\right)\\ &=\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\mathrm{U}\mathrm{U}^{\dagger}\right) -2\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\right) + \operatorname{Tr}\left(\mathrm{R}\mathrm{R}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\right) +\delta^4D \\ &\quad+2\delta^2\left(\operatorname{Tr}\left(\mathrm{U}\mathrm{U}^{\dagger}\right) -\operatorname{Tr}\left(\mathrm{R}\mathrm{R}^{\dagger}\right)\right)\\ &=\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{U}\mathrm{U}^{\dagger}\mathrm{U}\right) -2\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\mathrm{U}\right) + \operatorname{Tr}\left(\mathrm{R}^{\dagger}\mathrm{R}\mathrm{R}^{\dagger}\mathrm{R}\right) +\delta^4D \\ &\quad+2\delta^2\left(\operatorname{Tr}\left(\mathrm{U}^{\dagger}\mathrm{U}\right) -\operatorname{Tr}\left(\mathrm{R}^{\dagger}\mathrm{R}\right)\right)\\ &=\left\|\mathrm{U}^{\dagger}\mathrm{U}\right\|^2_F -2\left\|\mathrm{U}^{\dagger}\mathrm{R}\right\|^2_F + \left\|\mathrm{R}^{\dagger}\mathrm{R}\right\|^2_F +\delta^4D +2\delta^2\left(\left\|\mathrm{U}\right\|^2_F -\left\|\mathrm{R}\right\|^2_F\right) \end{aligned} \]

Mostly I use pytorch’slinear algebra.

Akimoto, Youhei. 2017.“Fast Eigen Decomposition for Low-Rank Matrix Approximation.” arXiv.

Ameli, Siavash, and Shawn C. Shadden. 2023.“A Singular Woodbury and Pseudo-Determinant Matrix Identities and Application to Gaussian Process Regression.”*Applied Mathematics and Computation* 452 (September): 128032.

Babacan, S. Derin, Martin Luessi, Rafael Molina, and Aggelos K. Katsaggelos. 2012.“Sparse Bayesian Methods for Low-Rank Matrix Estimation.”*IEEE Transactions on Signal Processing* 60 (8): 3964–77.

Bach, C, D. Ceglia, L. Song, and F. Duddeck. 2019.“Randomized Low-Rank Approximation Methods for Projection-Based Model Order Reduction of Large Nonlinear Dynamical Problems.”*International Journal for Numerical Methods in Engineering* 118 (4): 209–41.

Bach, Francis R. 2013.“Sharp Analysis of Low-Rank Kernel Matrix Approximations.” In*COLT*, 30:185–209.

Barbier, Jean, Nicolas Macris, and Léo Miolane. 2017.“The Layered Structure of Tensor Estimation and Its Mutual Information.”*arXiv:1709.10368 [Cond-Mat, Physics:math-Ph]*, September.

Bauckhage, Christian. 2015.“K-Means Clustering Is Matrix Factorization.”*arXiv:1512.07548 [Stat]*, December.

Berry, Michael W., Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. 2007.“Algorithms and Applications for Approximate Nonnegative Matrix Factorization.”*Computational Statistics & Data Analysis* 52 (1): 155–73.

Brand, Matthew. 2002.“Incremental Singular Value Decomposition of Uncertain Data with Missing Values.” In*Computer Vision — ECCV 2002*, edited by Anders Heyden, Gunnar Sparr, Mads Nielsen, and Peter Johansen, 2350:707–20. Berlin, Heidelberg: Springer Berlin Heidelberg.

———. 2006.“Fast Low-Rank Modifications of the Thin Singular Value Decomposition.”*Linear Algebra and Its Applications*, Special Issue on Large Scale Linear and Nonlinear Eigenvalue Problems, 415 (1): 20–30.

Chen, Yudong, and Yuejie Chi. 2018.“Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation: Recent Theory and Fast Algorithms via Convex and Nonconvex Optimization.”*IEEE Signal Processing Magazine* 35 (4): 14–31.

Chi, Yuejie, Yue M. Lu, and Yuxin Chen. 2019.“Nonconvex Optimization Meets Low-Rank Matrix Factorization: An Overview.”*IEEE Transactions on Signal Processing* 67 (20): 5239–69.

Cichocki, A., N. Lee, I. V. Oseledets, A.-H. Phan, Q. Zhao, and D. Mandic. 2016.“Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1.”*arXiv:1609.00893 [Cs]*, September.

Drineas, Petros, and Michael W. Mahoney. 2005.“On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning.”*Journal of Machine Learning Research* 6 (December): 2153–75.

Fasi, Massimiliano, Nicholas J. Higham, and Xiaobo Liu. 2023.“Computing the Square Root of a Low-Rank Perturbation of the Scaled Identity Matrix.”*SIAM Journal on Matrix Analysis and Applications* 44 (1): 156–74.

Flammia, Steven T., David Gross, Yi-Kai Liu, and Jens Eisert. 2012.“Quantum Tomography via Compressed Sensing: Error Bounds, Sample Complexity, and Efficient Estimators.”*New Journal of Physics* 14 (9): 095022.

Ghashami, Mina, Edo Liberty, Jeff M. Phillips, and David P. Woodruff. 2015.“Frequent Directions : Simple and Deterministic Matrix Sketching.”*arXiv:1501.01711 [Cs]*, January.

Gordon, Geoffrey J. 2002.“Generalized² Linear² Models.” In*Proceedings of the 15th International Conference on Neural Information Processing Systems*, 593–600. NIPS’02. Cambridge, MA, USA: MIT Press.

Gross, D. 2011.“Recovering Low-Rank Matrices From Few Coefficients in Any Basis.”*IEEE Transactions on Information Theory* 57 (3): 1548–66.

Gross, David, Yi-Kai Liu, Steven T. Flammia, Stephen Becker, and Jens Eisert. 2010.“Quantum State Tomography via Compressed Sensing.”*Physical Review Letters* 105 (15).

Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. 2010.“Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” arXiv.

Harbrecht, Helmut, Michael Peters, and Reinhold Schneider. 2012.“On the Low-Rank Approximation by the Pivoted Cholesky Decomposition.”*Applied Numerical Mathematics*, Third Chilean Workshop on Numerical Analysis of Partial Differential Equations (WONAPDE 2010), 62 (4): 428–40.

Harville, David. 1976.“Extension of the Gauss-Markov Theorem to Include the Estimation of Random Effects.”*The Annals of Statistics* 4 (2): 384–95.

Harville, David A. 1977.“Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems.”*Journal of the American Statistical Association* 72 (358): 320–38.

Hastie, Trevor, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. 2015.“Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” In*Journal of Machine Learning Research*, 16:3367–3402.

Henderson, H. V., and S. R. Searle. 1981.“On Deriving the Inverse of a Sum of Matrices.”*SIAM Review* 23 (1): 53–60.

Hoaglin, David C., and Roy E. Welsch. 1978.“The Hat Matrix in Regression and ANOVA.”*The American Statistician* 32 (1): 17–22.

Kannan, Ramakrishnan. 2016.“Scalable and Distributed Constrained Low Rank Approximations,” April.

Kannan, Ramakrishnan, Grey Ballard, and Haesun Park. 2016.“A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization.” In*Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming*, 9:1–11. PPoPP ’16. New York, NY, USA: ACM.

Kim, H., and H. Park. 2008.“Nonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method.”*SIAM Journal on Matrix Analysis and Applications* 30 (2): 713–30.

Kumar, N. Kishore, and Jan Shneider. 2016.“Literature Survey on Low Rank Approximation of Matrices.”*arXiv:1606.06511 [Cs, Math]*, June.

Liberty, Edo, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. 2007.“Randomized Algorithms for the Low-Rank Approximation of Matrices.”*Proceedings of the National Academy of Sciences* 104 (51): 20167–72.

Lim, Yew Jin, and Yee Whye Teh. 2007.“Variational Bayesian Approach to Movie Rating Prediction.” In*Proceedings of KDD Cup and Workshop*, 7:15–21. Citeseer.

Lin, Zhouchen. 2016.“A Review on Low-Rank Models in Signal and Data Analysis.”

Liu, T., and D. Tao. 2015.“On the Performance of Manhattan Nonnegative Matrix Factorization.”*IEEE Transactions on Neural Networks and Learning Systems* PP (99): 1–1.

Lu, Jun. 2022.“A Rigorous Introduction to Linear Models.” arXiv.

Mahoney, Michael W. 2010.*Randomized Algorithms for Matrices and Data*. Vol. 3.

Martinsson, Per-Gunnar. 2016.“Randomized Methods for Matrix Computations and Analysis of High Dimensional Data.”*arXiv:1607.01649 [Math]*, July.

Minka, Thomas P. 2000.*Old and new matrix algebra useful for statistics*.

Nakajima, Shinichi, and Masashi Sugiyama. 2012.“Theoretical Analysis of Bayesian Matrix Factorization.”*Journal of Machine Learning Research*, 66.

Nakatsukasa, Yuji. 2019.“The Low-Rank Eigenvalue Problem.” arXiv.

Nowak, W., and A. Litvinenko. 2013.“Kriging and Spatial Design Accelerated by Orders of Magnitude: Combining Low-Rank Covariance Approximations with FFT-Techniques.”*Mathematical Geosciences* 45 (4): 411–35.

Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012.“The Matrix Cookbook.”

Rokhlin, Vladimir, Arthur Szlam, and Mark Tygert. 2009.“A Randomized Algorithm for Principal Component Analysis.”*SIAM J. Matrix Anal. Appl.* 31 (3): 1100–1124.

Salakhutdinov, Ruslan, and Andriy Mnih. 2008.“Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo.” In*Proceedings of the 25th International Conference on Machine Learning*, 880–87. ICML ’08. New York, NY, USA: ACM.

Saul, Lawrence K. 2023.“A Geometrical Connection Between Sparse and Low-Rank Matrices and Its Application to Manifold Learning.”*Transactions on Machine Learning Research*, January.

Seeger, Matthias, ed. 2004.*Low Rank Updates for the Cholesky Decomposition*.

Seshadhri, C., Aneesh Sharma, Andrew Stolman, and Ashish Goel. 2020.“The Impossibility of Low-Rank Representations for Triangle-Rich Complex Networks.”*Proceedings of the National Academy of Sciences* 117 (11): 5631–37.

Shi, Jiarong, Xiuyun Zheng, and Wei Yang. 2017.“Survey on Probabilistic Models of Low-Rank Matrix Factorizations.”*Entropy* 19 (8): 424.

Srebro, Nathan, Jason D. M. Rennie, and Tommi S. Jaakkola. 2004.“Maximum-Margin Matrix Factorization.” In*Advances in Neural Information Processing Systems*, 17:1329–36. NIPS’04. Cambridge, MA, USA: MIT Press.

Sundin, Martin. 2016.“Bayesian Methods for Sparse and Low-Rank Matrix Problems.” PhD Thesis, KTH Royal Institute of Technology.

Tropp, Joel A., Alp Yurtsever, Madeleine Udell, and Volkan Cevher. 2016.“Randomized Single-View Algorithms for Low-Rank Matrix Approximation.”*arXiv:1609.00048 [Cs, Math, Stat]*, August.

———. 2017.“Practical Sketching Algorithms for Low-Rank Matrix Approximation.”*SIAM Journal on Matrix Analysis and Applications* 38 (4): 1454–85.

Tufts, D. W., and R. Kumaresan. 1982.“Estimation of Frequencies of Multiple Sinusoids: Making Linear Prediction Perform Like Maximum Likelihood.”*Proceedings of the IEEE* 70 (9): 975–89.

Türkmen, Ali Caner. 2015.“A Review of Nonnegative Matrix Factorization Methods for Clustering.”*arXiv:1507.03194 [Cs, Stat]*, July.

Udell, M., and A. Townsend. 2019.“Why Are Big Data Matrices Approximately Low Rank?”*SIAM Journal on Mathematics of Data Science* 1 (1): 144–60.

Wilkinson, William J., Michael Riis Andersen, Joshua D. Reiss, Dan Stowell, and Arno Solin. 2019.“End-to-End Probabilistic Inference for Nonstationary Audio Analysis.”*arXiv:1901.11436 [Cs, Eess, Stat]*, January.

Woodruff, David P. 2014.*Sketching as a Tool for Numerical Linear Algebra*. Foundations and Trends in Theoretical Computer Science 1.0. Now Publishers.

Woolfe, Franco, Edo Liberty, Vladimir Rokhlin, and Mark Tygert. 2008.“A Fast Randomized Algorithm for the Approximation of Matrices.”*Applied and Computational Harmonic Analysis* 25 (3): 335–66.

Xinghao Ding, Lihan He, and L. Carin. 2011.“Bayesian Robust Principal Component Analysis.”*IEEE Transactions on Image Processing* 20 (12): 3419–30.

Yang, Linxiao, Jun Fang, Huiping Duan, Hongbin Li, and Bing Zeng. 2018.“Fast Low-Rank Bayesian Matrix Completion with Hierarchical Gaussian Prior Models.”*IEEE Transactions on Signal Processing* 66 (11): 2804–17.

Yin, M., J. Gao, and Z. Lin. 2016.“Laplacian Regularized Low-Rank Representation and Its Applications.”*IEEE Transactions on Pattern Analysis and Machine Intelligence* 38 (3): 504–17.

Yu, Chenhan D., William B. March, and George Biros. 2017.“An\(N \log N\) Parallel Fast Direct Solver for Kernel Matrices.” In*arXiv:1701.02324 [Cs]*.

Yu, Hsiang-Fu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon. 2012.“Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems.” In*IEEE International Conference of Data Mining*, 765–74.

———. 2014.“Parallel Matrix Factorization for Recommender Systems.”*Knowledge and Information Systems* 41 (3): 793–819.

Zhang, Kai, Chuanren Liu, Jie Zhang, Hui Xiong, Eric Xing, and Jieping Ye. 2017.“Randomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling.” In*Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 615–23. KDD ’17. New York, NY, USA: ACM.

Zhang, Xiao, Lingxiao Wang, and Quanquan Gu. 2017.“Stochastic Variance-Reduced Gradient Descent for Low-Rank Matrix Recovery from Linear Measurements.”*arXiv:1701.00481 [Stat]*, January.

Zhou, Tianyi, and Dacheng Tao. 2011.“Godec: Randomized Low-Rank & Sparse Matrix Decomposition in Noisy Case.”

———. 2012.“Multi-Label Subspace Ensemble.”*Journal of Machine Learning Research*.

Learning for tabular data, i.e. the stuff you generally store inspreadsheets andrelational databases, and maybe evenprocess as structured text.

InR this means DataFrames. In Julia this means DataFrames. In python this means pandas Tables Xarrays, Dataframes, polars Tables or whatever else.

- pandas is more-or-less a dataframe class for python.
`pandas`

plus`statsmodels`

look a lot likeR. On the minus side, this combination lack some language features of R (e.g. regression formulae are not first class language features). On the plus side, they lack some language misfeatures of R (the object model being a box of turds, and copy-by-value semantics and all those other annoyances.)

I am not a huge fan of pandas, personally. The engineering behind it is impressive, but the workflow ends up not fitting my actual problems particularly well. Fun to look at but not to touch. This might be about my workflow being idiosyncratic, or it might be because the original author had an idiosyncratic workflow, and it is he who needed weird features that the rest of us trip over. YMMV.

In comparison to R, one crucial weakness is that R has a rich ecosystem of tools for dataframes. Python is a bit thinner.

Also (and I do not know if this was the true process or not) when pandas was designed, Wes McKinney made a bunch of design choices differently than R did, possibly thinking to himself “Why did R not do it this way which is clearly better”, only to discover that in practice R’s way was better, or that the cool hack ended up being awkward in python syntax. Chief among these is the obligatory indexing of rows in the table; I spend a lot of time fighting pandas’ insistence on wanting everything to be named, and then interpreting those arbitrary names as meaningful.

Anyway, this is still very usable and useful.

Lots of nice things are built on pandas, such as …

statsmodels, which is more-or-less a minimalist subset of standard R, but Python. Implements

- Linear regression models
- Generalized linear models
- Discrete choice models
- Robust linear models
- Many models and functions for time series analysis
- Nonparametric estimators
- A wide range of statistical tests
- etc

patsy implements a formula language for`pandas`

.
Patsy does lots of things, but most importantly, it

- builds design matrices
(i.e. it knows how to represent
`z~x^2+x+y^3`

as a matrix, which only sounds trivial if you haven’t tried it) - statefully preconditions data (e.g. constructs data transforms that will correctly normalise the test set as well as the training data.)

Pandas AI: The Generative AI Python Library makes pandas’s occasionally-abstruse query language a bit more natural.

```
import pandas as pd
from pandasai import PandasAI
# Sample DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
# Instantiate a LLM
from pandasai.llm.openai import OpenAI
llm = OpenAI()
pandas_ai = PandasAI(llm)
pandas_ai.run(df, prompt='Which are the 5 happiest countries?')
```

pandera implements type sanity and validation.

The pandas API is popular; there are a few tools which aim to accelerate calculations by providing backends for it based onalternative data formats orparallelism needs.

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF also provides a pandas-like API that will be familiar to data engineers & data scientists, so they can use it to easily accelerate their workflows without going into the details of CUDA programming.

Scale your pandas workflow by changing a single line of code — The modin.pandas DataFrame is an extremely light-weight parallel DataFrame. Modin transparently distributes the data and computation so that all you need to do is continue using the pandas API as you were before installing Modin. Unlike other parallel DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.

Koalas: pandas API on Apache Spark seems to be Modin but for Spark.

Python dataframe interchange protocol

Python users today have a number of great choices for dataframe libraries. From Pandas and cuDF to Vaex, Koalas, Modin, Ibis, and more. Combining multiple types of dataframes in a larger application or analysis workflow, or developing a library which uses dataframes as a data structure, presents a challenge though. Those libraries all have different APIs, and there is no standard way of converting one type of dataframe into another.

Polars is a blazingly fast DataFrames library implemented in Rust usingApache Arrow Columnar Format as the memory model.

- Lazy | eager execution
- Multi-threaded
- SIMD
- Query optimization
- Powerful expression API
- Hybrid Streaming (larger than RAM datasets)
- Rust | Python | NodeJS | ...

Does not invent a python-specific data format but instead leveragesApache arrow.
It*looks* like pandas in many ways, but is not 100% compatible, seePolars for pandas users.
This means that the ecosystem is thinner again than the pandas ecosystem; on the other hand, some stuff looks easier than pandas, so maybe it is not too bad in practice.

Welcome to the most popular page on this blog. No joke, this one is a perennial favourite. Please leave a common if you notice something wrong or outdated; you will help hundreds of people.

Visual Studio Code is for me the overall best way of editing LaTeX. Compared to a special purpose editor (e.g. TeXShop) the preview and workflow is inferior but adequate. The actual editing experience is superior, better streamlined and more integrated into my workflow than any weird-tin-pot specialist editor maintained by one quirky academic somewhere. However, I had to tweak it a bit to get it to work really well, and I made some mistakes. Successes and failures both are reproduced here for your delectation and amusement. Alexander Zeilmann makes an eloquent argument for this choice inhis LaTeX Workflow post, which includes additional tips.

A second-best isoverleaf, which is a web-based editor, and thus gets bonus points for being collaborative.

There is one major decision point here, which is the choice of which extension to use

Texlab is more aesthetic and elegant to my mind, but it is not as easy or as well-supported by a large community.

Set up a git.gitignore to stop VS Code freaking out about all the auxiliary files.

Contra myVS Code spell checking advice,SpellRight isunsupported for LaTeX-Workshop.

If I persevere in using spellright, the following exclusions make things tidier:

```
"spellright.ignoreRegExpsByClass": {
"latex": [
"/\\\\begin\\\\{{equation,align}\\\\}(.*?)\\\\end\\\\{{equation,align}\\\\}/mg",
"/\\\\{autoref,autocites?,cites?}\\\\{(.*?)\\\\}/g",
],
},
```

But it will still be horrible.
Probably better is to disable spellright for`.tex`

files in favour ofcspell instead.
SupposedlyLanguageTool
is good for more diverse languages, but I have not tried it.

Alternatively, here is a new entrant,LT_{E}X:

LT

_{E}X — Grammar/Spell Checker Using LanguageTool with Support for L^{A}T_{E}X, Markdown, and Others

LTprovides offline grammar checking of various markup languages usingLanguageTool (LT). LT_{E}X_{E}X can be used standalone as a command-line tool, as a language server using the Language Server Protocol (LSP), or directly in various editors using extensions.LT

_{E}X currently supports BibT_{E}X, ConT_{E}Xt, L^{A}T_{E}X, Markdown, Org, reStructuredText, R Sweave, and XHTML documents.The difference to regular spell checkers is that LT

_{E}X not only detects spelling errors, but also many grammar and stylistic errors…A classic use case of LT

_{E}X is checking scientific L^{A}T_{E}X papers, but why not check your next blog post, book chapter, or long e-mail before you send it to someone else?

TBC.

The disruptive entrant in the crowded field of LaTeX extensions for VS CodeTexlab.

AVisual Studio Code extension that provides rich editing support for theLaTeX typesetting system, powered by theTexLab language server. It aims to produce high quality code completion results by indexing your used packages as you type.

Recommended matching syntax highlighter:latex-syntax.

Neat feature: supported configuration uses the modernised and streamlinedtectonic distribution instead of classic LaTeX.

This extensions makes some radical design choices in the name of simplicity and elegance, which might mean it is less confusing On the other hand it probably means that some features, edge cases and/or misfeatures in classic TeX will be unsupported. For an extension which can handle that weird workflow that you have been working on since 1997, seeLaTeX Workshop.

`ext install efoerster.texlab`

The build/preview workflow is jarring; you need to start your own previewer then maintain sync, which requirescustom setup depending on which unsatisfactoryPDF viewer you use.
Which means different settings on each OS.
After that… TBH I cannot work out how to~~invoke the build step from inside VS code. Or invoke the PDF viewer.~~
I cannot work out how to invoke the PDF viewer*except* via synctex.
This means that, unintuitively, I invoke the viewer by the command`LaTeX: Forward Search`

.

For GNOME desktops,evince-synctex works.

```
{
"texlab.forwardSearch.executable": "evince-synctex",
"texlab.forwardSearch.args": ["-f", "%l", "%p", "\"code -g %f:%l\""]
}
```

Installation by one of the following:

```
pip3 install --user https://github.com/efoerster/evince-synctex/archive/master.zip
pipx install https://github.com/efoerster/evince-synctex/archive/master.zip
```

On macOS we had best useSkim, apparently.Skim docs on TeX and PDF Synchronization reveal that
Skim must be in`Check for file changes`

mode (see app settings).

```
{
"texlab.forwardSearch.executable": "/Applications/Skim.app/Contents/SharedSupport/displayline",
"texlab.forwardSearch.args": ["%l", "%p", "%f"]
}
```

If you, as I, work on multiple OSes, we need todisable sync for these machine-specific settings, or it will constantly be trying to run evince on macos or some such.

LaTeX-Workshop
is the defaultVS CodeLaTeX editing extension.
Originally I was skeptical, despite its many features, because it was confusing when things went wrong
(Where WAS that syntax error?) which meant it was not good at its main job because*syntactically invalid* is the normal state of a document I am writing.

However, the new version is much better; I just need to make sure that little TeX sidebar is active and everything becomes apparent. Some things are still fiddly.

LaTeX-workshop comes with an aggressively intrusive set of keybindings that clash with so many other things that for me at least there is no sense fixing them up one-by-one. Instead, I use the following setting to switch to the alternative keymap:

` "latex-workshop.bind.altKeymap.enabled": true,`

This is not well documented; seePull Request #983 orthe FAQ entry.
The above setting changes all`Ctrl-Alt-<key>`

shortcuts to`Ctrl-L, Alt-<key>`

.
Translate the versions from the manual in your head.

VS Code supposedly canpreview individual equations viajavascript. Thisdoes not work for me on substantial documents, and I cannot work out why, because there is no error dumped anywhere I can see it. I will never find time to work out what is happening. Only simple equations in trivial test documents have preview for me.

It might be something about fancy macros or packages I use?~~Javascript mathematics are blind to the LaTeX loaded packages, but we can globally fake out some use-cases with somewith global MathJax extensions:~~

```
"latex-workshop.hover.preview.mathjax.extensions": [
"boldsymbol"
]
```

**UPDATE**: At some point this started working for me even with reasonably challenging macros.
I do not know what changed.

LaTeX workshop supportssmart snippets with autocomplete.

Reproduced here for my offline reference are the most useful ones.

Prefix | Environment name |
---|---|

`BEQ` | `equation` |

`BSEQ` | `equation*` |

`BAL` | `align` |

`BSAL` | `align*` |

`BIT` | `itemize` |

`BEN` | `enumerate` |

`BSPL` | `split` |

`BCAS` | `cases` |

`BFR` | `frame` |

`BFI` | `figure` |

Prefix | Sectioning level |
---|---|

`SPA` | part |

`SCH` | chapter |

`SSE` | section |

`SSS` | subsection |

`SS2` | subsubsection |

`SPG` | paragraph |

`SSP` | subparagraph |

The next ones are not*snippets*, apparently, but they looks the same to me:@ suggestions:

Prefix | Command |
---|---|

`@(` | `\left( $1 \right)` |

`@{` | `\left\{ $1 \right\}` |

`@[` | `\left[ $1 \right]` |

`__` | `_{$1}` |

`**` | `^{$1}` |

`@8` | `\infty` |

`@6` | `\partial` |

`@/` | `\frac{$1}{$2}` |

`@%` | `\frac{$1}{$2}` |

`@_` | `\bar{$1}` |

`@I` | `\int_{$1}^{$2}` |

`@|` | `|` |

`@\` | `\setminus` |

`@,` | `\nonumber` |

Building using something modern or fancy?
The default build workflow is some`pdflatex`

+`BibTeX`

system that is ok but outdated.
I would prefer a simpler more modern workflow that supports e.g. XeTeX.
Out of the box, neither XeTeX or`latexmk`

were available workflows for me.**UPDATE**: These are now supported .

There are two ways to do set these up.

Firstly, the old school way that the developers of LaTeX workshop seem to dislike, but which was far and away the easiest for me: Use the Latex Workshop-styleTeX magic This looks like, e.g.

```
% !TEX program = latexmk
% !TEX options = -synctex=1 -pdf -file-line-error -halt-on-error -xelatex -outdir="%OUTDIR%" "%DOC%"
```

I put that at the start of the master document and things behave as expected.
Good.
For this particular build tool, latexmk, to actually have any advantage, one would want to disable auto-clean
(a.k.a. delete everything and rebuild from scratch all the time)
so that`latexmk`

can be smart about rebuilds.
Looks like the following setting does that.

` "latex-workshop.latex.autoBuild.cleanAndRetry.enabled": false,`

The price one pays for this is needing to manually clean up the detritus from time to time.

The developer-preferred way of configuring the LaTeX build is difficult, verbose and error-prone but supposedly ineffably*better*.
That kind of argument is one that I am curiously susceptible to, in that I occasionally find myself buying biodynamic yoghurt even though biodynamic is definitely nonsense.

Anyway, in the biodynamic configuration method, we makenew “recipes” in the configuration JSON.
The process is 😴.
Here, cargo cult the ones I use, which in practice are all about`latexmk`

, which is generally easiest and fastest for me.

**Q:** Are they still necessary, or does modern LaTeX workshop support`latexmk`

out of the box?

```
"latex-workshop.latex.recipes": [
{
"name": "latexmk 🔃",
"tools": [
"latexmk"
]
},
{
"name": "xelatexmk 🔃",
"tools": [
"xelatexmk"
]
},
{
"name": "platexmk 🔃",
"tools": [
"platexmk"
]
}
],
"latex-workshop.latex.tools": [
{
"name": "latexmk",
"command": "latexmk",
"args": [
"-synctex=1",
"-file-line-error",
"-pdf",
"-outdir=%OUTDIR%",
"%DOC%"
],
"env": {}
},
{
"name": "platexmk",
"command": "latexmk",
"args": [
"-synctex=1",
"-file-line-error",
"-pdf",
"-outdir=%OUTDIR%",
"%DOC%"
],
"env": {}
},
{
"name": "xelatexmk",
"command": "latexmk",
"args": [
"-synctex=1",
"-file-line-error",
"-xelatex",
"-pdf", // controversial bit
"-outdir=%OUTDIR%",
"-interaction=nonstopmode",
"%DOC%"
],
"env": {}
},
]
```

Qiita, inVSCode で最高の LaTeX 環境を作る demonstrates how to make the`latexmk`

workflow more configurable, so that you can essentially*always* use`latexmk`

.Alexander Zeilmann demonstrates a more minimalist one.
In practice, I do not want to learn*another* domain-specific language to configure a build system, so I am sticking with the verbose JSON configuration.

Usually I use the`platexmk`

configuration out of necessity although I prefer`xelatexmk`

.
Occasionally I use vanilla`latexmk`

when I have to output for some journal who really hates modern character sets.

The disruptive newtectonic build system notionally minimises the amount of configuration we need to do by being smart and defaulting to modern choices.To use tectonic with LaTeX-workshop:

```
"latex-workshop.latex.tools": [
{
"name": "tectonic",
"command": "tectonic",
"args": [
"--synctex",
"%DOC_EXT%"
],
"env": {}
}
],
"latex-workshop.latex.recipes": [
{
"name": "tectonic",
"tools": [
"tectonic"
]
}
]
```

See alsoUsing Tectonic with VS Code on the tectonic GitHub forum.*pace* James Yu, it needs`"--synctex"`

and`"%DOC_EXT%"`

to work properly.

Now we can use the tectonic system from LaTeX-workshop!
That is, like,*biodynamic and ketogenic at the same time*, or something.

Advanced: compile scientific workbooks such asknitr.See docs

There are some bonus classic-flavour build recipesat the end.

SyncTeX makes discerning what I am typing somewhat easier, by connecting the cursor location in the source code to the viewing location in the PDF (and, on some systems, vice versa).
The built-in implementation in Latex-workshop OK; we are somewhat hamstrung because everything is constrained to a single window by VS Code.^{1}
External PDF viewers arenot officially supported but more-or-less work for me, and are superior for a dual monitor setup, although definitely jankier.

Minuses of external PDF viewers:

- Documentation is perfunctory.
~~Sync works in AFAICT one direction only — I can sync from VS Code to the PDF viewer, but the reverse does not work for me.~~

Since they are unsupported, one needs to guess the config to make them go.

NB: I'm no fan of the default PDF viewers on linux desktops, if I edit more TeX on linux I will probably move toSioyek orZathura.

The default keyboard shortcut for syncing is`⇧⌥⌃J`

, or`⌘L⌥J`

with the alternate keymap.

A good example of reverse engineering the config forkubuntu’s PDF viewer`okular`

is given by Heiko/@miteron.

```
{
"latex-workshop.view.pdf.viewer": "external",
// @sync host=work-pc
"latex-workshop.view.pdf.external.viewer.command": "okular",
// @sync host=work-pc
"latex-workshop.view.pdf.external.viewer.args": [
"--unique",
"%PDF%"
],
// @sync host=work-pc
"latex-workshop.view.pdf.external.synctex.command": "okular",
// @sync host=work-pc
"latex-workshop.view.pdf.external.synctex.args": [
"--unique",
"%PDF%#src:%LINE%%TEX%"
],
}
```

For GNOME doc viewer`evince`

things aremore complicated.
Specifically, I neededa bridging script and special config:

```
{
// @sync host=home-pc
"latex-workshop.view.pdf.external.viewer.command": "evince2",
// @sync host=home-pc
"latex-workshop.view.pdf.external.viewer.args": [
"%PDF%"
],
// @sync host=home-pc
"latex-workshop.view.pdf.external.synctex.command": "evince_forward_search",
// @sync host=home-pc
"latex-workshop.view.pdf.external.synctex.args": [
"%PDF%",
"%LINE%",
"%TEX%"
],
}
```

For macOs, Skim.app is reliable and popular.

Backward (i.e. viewer-to-LaTeX) sync is by`⇧⌘-Click`

.Documentation here.
To make it work I had to change the defaults.
Skim comes with its own VS Code preset, which uses the command`open`

to open`vscode://file"%urlfile":%line`

.
We can force VS Code insiders`open`

to open`vscode-insiders://file"%urlfile":%line`

.
This did not work for me on one of my machines for no discerible reason.
The slightly slower method of requiring Skim use`code`

to open`--goto "%file":%line`

worked in that case.
The following was the necessary setup from the VS Code side:

```
"latex-workshop.view.pdf.viewer": "external",
"latex-workshop.view.pdf.external.viewer.command": "/Applications/Skim.app/Contents/SharedSupport/displayline",
"latex-workshop.view.pdf.external.viewer.args": [
"0",
"%PDF%"
],
"latex-workshop.view.pdf.external.synctex.command": "/Applications/Skim.app/Contents/SharedSupport/displayline",
"latex-workshop.view.pdf.external.synctex.args": [
"-revert", // reload
"-readingbar", // highlight line
"-background", // do not steal focus
"%LINE%",
"%PDF%",
"%TEX%",
],
```

The`background`

setting is contentious.
Skimm.app steals input focus if it comes to the foreground which is often not what we want.
But also, having it come to the foreground is a plus, because then we can see it.

Gotcha: If there are weird “provider errors” in the vscode`exthost`

log,
the problemmight be
that`latexindent`

must be installed, and possibly it was not for me because I installed aminimalist TeX distribution.

`tlmgr install latexindent`

If I want to switch TeX installations, I can do so byenvironment variables, so I whack something like the following in my`settings.json`

:

```
{
"name": "latexmk",
"command": "latexmk",
"args": [
"-synctex=1",
"-interaction=nonstopmode",
"-file-line-error",
"-pdf",
"-outdir=%OUTDIR%",
"%DOC%"
],
"env": {"TEXMFHOME": "c:/texlive/2019"} // That's how you do it
}
```

SomeLaTeX-Workshop build configurations I no longer use (because latexmk is so good), but which might be useful to others.

```
"latex-workshop.latex.recipes": [
{
"name": "pdflatex ➞ bibtex ➞ pdflatex`×2",
"tools": [
"pdflatex",
"bibtex",
"pdflatex",
"pdflatex"
]
},
{
"name": "xelatex ➞ biber ➞ xelatex",
"tools": [
"xelatex",
"biber",
"xelatex"
]
}
],
"latex-workshop.latex.tools": [
{
"name": "pdflatex",
"command": "pdflatex",
"args": [
"-synctex=1",
"-interaction=nonstopmode",
"-halt-on-error",
"-file-line-error",
"%DOC%"
],
"env": {}
},
{
"name": "xelatex",
"command": "xelatex",
"args": [
"-synctex=1",
"-interaction=nonstopmode",
"-halt-on-error",
"-file-line-error",
"%DOC%"
],
"env": {}
},
{
"name": "bibtex",
"command": "bibtex",
"args": [
"%DOCFILE%"
],
"env": {}
},
{
"name": "biber",
"command": "biber",
"args": [
"%DOCFILE%"
],
"env": {}
}
]
```

Althoughsee VS Code’s dual window hack).↩︎

Matrix norms! Ways of measuring the “size” of a matrix, or, more typically for me, the “distance” between matrices, which is the size of one minus the other. Useful as optimisation objectives, or as metrics for comparing matrices.

I write the singular value decomposition of a\(m\times n\) matrix\(\mathrm{B}\)\[ \mathrm{B} = \mathrm{U}\boldsymbol{\Sigma}\mathrm{V} \] into unitary matrices\(\mathrm{U},\, \mathrm{V}\) and a matrix, with non-negative diagonals\(\boldsymbol{\Sigma}\), of respective dimensions\(m\times m,\,n\times n,\,m\times n\).

The diagonal entries of\(\boldsymbol{\Sigma}\), written\(\sigma_i(\mathrm{B})\) are the*singular values* of\(\mathrm{B}\)
and the entries of\(\mathrm{B}\) itself as with the entries\(b_{jk}\).

The absolutely classic one, defined in terms of norms on two other spaces, and using the matrix as an operator mapping between them. Define two norms, one on\(\mathbb{R}^m\) and one on\(\mathbb{R}^n\). These norm induces an operator norm for matrices\(\mathrm{B}: \mathbb{R}^n \to \mathbb{R}^m\),\[ \|\mathrm{B}\|_{\text{o p}}=\sup _{\|v\|\leq 1}\|\mathrm{B} v\|=\sup _{v \neq 0} \frac{\|\mathrm{B} v\|}{\|v\|}, \] where\(v \in \mathbb{R}^n\). Note that the operator norm depends on the underlying norms. When the norm on both spaces is\(\ell_p\) for some\(1\leq p \leq \infty\), i.e.\(\|v\|_p=\left(\sum_i v_i^p\right)^{1 / p}\), we conventionally write\(\|\mathrm{B}\|_p\). The definition of the operator norm gives rise to the following useful inequality which was the workhorse of my functional analysis classes for proving convergence and thus equivalence of norms:\[ \|\mathrm{B} v\| \leq\|\mathrm{B}\|_{o p}\|v\| \] for any\(v \in \mathbb{R}^n\).

Operator norm with\(p=2\), i.e. the largest singular value.

Famous useful relation to the Frobenius norm:\[\|\mathrm{B}\|_2 \leqslant\|\mathrm{B}\|_F \leqslant \sqrt{n}\|\mathrm{B}\|_2\]. There aremore relations though.

Coincides with the\(\ell_2\) norm when the matrix happens to be a column vector.

We can define the Frobenius norm in several equivalent ways,\[ \begin{aligned} \|\mathrm{B}\|_F^2 &=\sum_{j=1}^{m}\sum_{k=1}^{n}|b_{jk}|^2\\ &=\operatorname{tr}\left(\mathrm{B}\mathrm{B}^{\top}\right)\\ &=\operatorname{tr}\left(\mathrm{B}^{\top}\mathrm{B}\right)\\ &=\sum_{j=1}^{\min(m,n)}\sigma_{j}(\mathrm{B})^2 &(*)\\ &=\langle\mathrm{B},\mathrm{B}\rangle_F \end{aligned} \]

It is useful to define the Frobenius norm in terms of theFrobenius inner product that we slyly introduced above:\[ \begin{aligned} \langle\mathrm{A}, \mathrm{B}\rangle_{\mathrm{F}}=\sum_{i, j} \overline{A_{i j}} B_{i j}=\operatorname{Tr}\left(\overline{\mathrm{A}^{\top}} \mathrm{B}\right) \equiv \operatorname{Tr}\left(\mathrm{A}^{\dagger} \mathrm{B}\right). \end{aligned} \]

Handy property for partitioned matrices:\[ \begin{aligned} \left\|\left[\begin{array}{c|c} \mathrm{A}_{11} & \mathrm{A}_{12} \\ \hline \mathrm{A}_{21} & \mathrm{A}_{22} \end{array}\right]\right\|_F^2 &=\left\|\mathrm{A}_{11} \right\|_F^2 + \left\|\mathrm{A}_{12}\right\|_F^2 + \left\|\mathrm{A}_{21}\right\|_F^2 + \left\|\mathrm{A}_{22}\right\|_F^2 \end{aligned} \]

Handy property for low-rank-style symmetric products of tall, skinny matrices\[ \begin{aligned} \|\mathrm{A}\mathrm{A}^{\top}\|_F^2 =\operatorname{tr}\left(\mathrm{A}\mathrm{A}^{\top}\mathrm{A}\mathrm{A}^{\top}\right) =\operatorname{tr}\left(\mathrm{A}^{\top}\mathrm{A}\mathrm{A}^{\top}\mathrm{A}\right) =\|\mathrm{A}^{\top}\mathrm{A}\|_F^2 \end{aligned} \]

Handy property for low-rank-style products of non-symmetric tall-skinny matrices:\[ \begin{aligned} \|\mathrm{A}\mathrm{B}^{\top}\|_F^2 =\operatorname{tr}\left(\mathrm{A}\mathrm{B}^{\top}\mathrm{B}\mathrm{A}^{\top}\right) =\operatorname{tr}\left((\mathrm{A}^{\top}\mathrm{A})(\mathrm{B}^{\top}\mathrm{B})\right) \end{aligned} \] This latter form does not require us to form the gigantic tall, wide matrix.

This norm seems to be mostly useful because it is tractable; squared Frobenius norm is simple and minimising squared Frobenius minimises basic Frobenius. And that representation as a simple trace, we candifferentiate that easily.

It is not completely silly as a norm in its own right. though. Note that\((*)\) line shows us that the Frobenius norm is equivalently the\(\ell_2\) norm of the singular value vector. This is surprising to me, since this otherwise would feel like a kinda “dumb” norm. “Pretending the matrix is a vector” feels to me like it shouldn’t work, but look! there is some kind of statistic that we might care about there. Also, Frobenius normbounds some other “more interpretable” norms, so it is indirectly useful.

Frobenius alsoinduces a computationally expedient distance for low-rank-plus-diagonal matrices.

A.k.a. trace norm, Schatten 1-norm. The sum of a matrix’ singular values.

\[ \begin{aligned} \|\mathrm{B}\|_* &=\operatorname{tr}\left(\sqrt{\mathrm{B}^{\top}\mathrm{B}}\right)\\ &=\sum_{j=1}^{\min(m,n)}\sigma_{j}(\mathrm{B}) \end{aligned} \]

Generalising nuclear and Frobenius norms. The Schatten p-norm is defined\[ \|\mathrm{B}\|_{p}=\left(\sum _{i=1}^{\min\{m,\,n\}}\sigma _{i}^{p}(\mathrm{B})\right)^{1/p}. \]

The case\(p = 2\) yields the Frobenius norm. The case\(p =\infty\) yields the spectral norm. Finally,\(p = 1\) yields the nuclear norm, also known as the Ky Fan norm.

🏗 Relation to exponential family and maximum likelihood.

Mark Reid:Meet the Bregman divergences:

If you have some abstract way of measuring the “distance” between any two points and, for any choice of distribution over points the mean point minimises the average distance to all the others, then your distance measure must be a Bregman divergence.

TBC

Mazumder, Hastie, and Tibshirani (2010) has a very nice little lemma (Lemma 6 in the paper) that links the nuclear norm of a matrix (i.e. the sum of its singular values) to the solution of an optimization problem involving Frobenius norms. Here is the lemma: Lemma: For any matrix\(Z \in \mathbb{R}^{m \times n}\), we have\[ \|Z\|_*=\min _{U, V: Z=U V^{\top}} \frac{1}{2}\left(\|U\|_F^2+\|V\|_F^2\right) . \] If\(\operatorname{rank}(Z)=k \leq \min (m, n)\), then the minimum above is attained at a factor decomposition\(Z=U_{m \times k} V_{n \times k}^{\top}\).

fromPetersen and Pedersen (2012):

E. H. Rasmussen has in yet unpublished material derived and collected the following inequalities. They are collected in a table as below, assuming\(\mathrm{B}\) is an\(m \times n\), and\(d=\operatorname{rank}(\mathrm{B})\)

\(\|\mathrm{B}\|_{max}\) | \(\|\mathrm{B}\|_{1}\) | \(\|\mathrm{B}\|_{\infty}\) | \(\|\mathrm{B}\|_{2}\) | \(\|\mathrm{B}\|_{F}\) | \(\|\mathrm{B}\|_{KF}\) | |
---|---|---|---|---|---|---|

\(\|\mathrm{B}\|_{max}\) | \(1\) | \(1\) | \(1\) | \(1\) | \(1\) | |

\(\|\mathrm{B}\|_{1}\) | \(m\) | \(m\) | \(\sqrt{m}\) | \(\sqrt{m}\) | \(\sqrt{m}\) | |

\(\|\mathrm{B}\|_{\infty}\) | \(n\) | \(n\) | \(\sqrt{n}\) | \(\sqrt{n}\) | \(\sqrt{n}\) | |

\(\|\mathrm{B}\|_{2}\) | \(\sqrt{mn}\) | \(\sqrt{n}\) | \(\sqrt{m}\) | \(1\) | \(1\) | |

\(\|\mathrm{B}\|_{F}\) | \(\sqrt{mn}\) | \(\sqrt{n}\) | \(\sqrt{m}\) | \(\sqrt{d}\) | \(1\) | |

\(\|\mathrm{B}\|_{KF}\) | \(\sqrt{mnd}\) | \(\sqrt{nd}\) | \(\sqrt{md}\) | \(d\) | \(\sqrt{d}\) |

which are to be read as, e.g.

\[ \|\mathrm{B}\|_2 \leq \sqrt{m} \cdot\|\mathrm{B}\|_{\infty} \]

Sorry there are some weird line breaks in that table; not sure how to fix that.

Chen, Yudong, and Yuejie Chi. 2018.“Harnessing Structures in Big Data via Guaranteed Low-Rank Matrix Estimation: Recent Theory and Fast Algorithms via Convex and Nonconvex Optimization.”*IEEE Signal Processing Magazine* 35 (4): 14–31.

Dokmanić, Ivan, and Rémi Gribonval. 2017.“Beyond Moore-Penrose Part II: The Sparse Pseudoinverse.”*arXiv:1706.08701 [Cs, Math]*, June.

Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani. 2010.“Spectral Regularization Algorithms for Learning Large Incomplete Matrices.”*Journal of Machine Learning Research* 11 (80): 2287–2322.

Mercer, A. McD. 2000.“Bounds for A–G, A–H, G–H, and a Family of Inequalities of Ky Fan’s Type, Using a General Method.”*Journal of Mathematical Analysis and Applications* 243 (1): 163–73.

Minka, Thomas P. 2000.*Old and new matrix algebra useful for statistics*.

Moakher, Maher, and Philipp G. Batchelor. 2006.“Symmetric Positive-Definite Matrices: From Geometry to Applications and Visualization.” In*Visualization and Processing of Tensor Fields*, edited by Joachim Weickert and Hans Hagen, 285–98. Berlin, Heidelberg: Springer Berlin Heidelberg.

Omladič, Matjaž, and Peter Šemrl. 1990.“On the Distance Between Normal Matrices.”*Proceedings of the American Mathematical Society* 110 (3): 591–96.

Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012.“The Matrix Cookbook.”

Sharma, Rajesh. 2008.“Some More Inequalities for Arithmetic Mean, Harmonic Mean and Variance.”*Journal of Mathematical Inequalities*, no. 1: 109–14.

Portable Document Format, the abstruse and inconvenient format beloved of academics, bureaucracies and Adobe. It has the notable feature of being a better format than Microsoft Word, in much the same way that sticking your hand in a blender is better than sticking your hand in a woodchipper.

Look,they can include video games.

SeePDF readers.

Tabula is a tool for liberating data tables locked inside PDF files.

pdfplumber also exists but I have not used it.

Camelot is an OpenCV-backed table extractor. It has a browser-based gui,Excalibur.

`pip install excalibur-py`

There are both open (Tabula, pdfplumber) and closed-source tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy.

**tl;dr** Tabula if you want the easiest possibly experience
at the cost of some power, otherwise Camelot/Excalibur.

`camelot -o table.csv -f csv lattice file.pdf`

Commercial online PDF toolsmallpdf claims to do this.(USD12/month with free trial)

A classic, very useful but kind of ugly and bizarre to use. SeeGhostscript for more info; I won’t document it here because I prefer easier tools. I keep ghostscript around because some other software I use depends upon it. 🏗️

Exception: it has a really handy thing:

There is a command todowngrade high-version PDFs to v 1.4, which is safely old:

`ps2pdf14 modern_doc.pdf old_doc.pdf`

Very helpful for version incompatibility.

Less powerful, but simpler and less fragile than ghostscript.

- QPDF: A Content-Preserving PDF Transformation System
- qpdf/qpdf: Primary QPDF source code and documentation
- QPDF Manual

QPDF is a command-line program that does structural, content-preserving transformations on PDF files. It could have been called something like pdf-to-pdf. It also provides many useful capabilities to developers of PDF-producing software or for people who just want to look at the innards of a PDF file to learn more about how they work. … QPDF includes support for merging and splitting PDFs through the ability to copy objects from one PDF file into another and to manipulate the list of pages in a PDF file. Th…

QPDF is not a PDF content creation library, a PDF viewer, or a program capable of converting PDF into other formats. In particular, QPDF knows nothing about the semantics of PDF content streams. If you are looking for something that can do that, you should look elsewhere. However, once you have a valid PDF file, QPDF can be used to transform that file in ways perhaps your original PDF creation can’t handle. For example, programs generate simple PDF files but can’t password-protect them, web-optimize them, or perform other transformations of that type.

I need to do this so often that I may yet get around to making a keyboard shortcut for it.

**tl;dr** To concatenate PDFs~~on macOS I use~~
On`join.py`

.~~other~~ POSIX systems I use qpdf.
To*split* PDFs on macos I use preview.

There are many ways to do this. Concatenating PDFS is where qpdf excels over ghostscript, although ghostscript is older and so has more HOWTOs.

A classic that I see around is this ghostscript command:

```
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress -sOutputFile=output.pdf input*.pdf
```

This sometimes works, and sometimes behaves badly in ways that I have not investigated — the output file can be massive, much larger than the sum of its parts. Sometimes it is lossy and the fonts are mangled.

The QPDF version is not*much* more intuitive but seems to mangle the PDF less often:

`qpdf --empty --pages *.pdf -- out.pdf`

On macos there~~is~~ wassystem PDF concatenation.**tl;dr**

```
"/System/Library/Automator/Combine PDF Pages.action/Contents/Resources/join.py" \
-o PATH/TO/YOUR/MERGED/FILE.pdf \
/PATH/TO/PDFS/*.pdf
```

It is~~easy and~~ robust and seems to work on any PDF that macos understands without damaging the contents.
Also the PDFs it creates have a size approximately equal to the sum of the parts and not mysteriously much much larger.

**UPDATE**: now that python no longer ships with macos, this python2 script is still available but tedious to execute.
There is aright click PDF merge in the Finder though.

pdfunite is thepoppler concatenate command. Poppler is fairly ubiquitous and odds are good it is already installed.

You can*split* PDFs also with ghostscript, but usually you want a GUI to see what you are splitting, no?
On MacOS I use Preview.app.
I do not have a favourite yet on other systems.

Antoine Chambert-Loirpoints out the underdocumented command`texexec`

hasmany PDF-editing sub-commands.
The following command extract pages 1 to 5, page 7 and pages 8 to 12 from`file.pdf`

and puts it in`outputfile.pdf`

`texexec --pdfselect --select=1:5,7,8:12 --result=outputfile.pdf file.pdf`

There is alsoPDFtk which has a handy command line. This has been ported to java aspdftk-java and also has a gui calledPDF Chain. Might be good? Special power:extracting PDF attachments.

PDFMix andPDF shuffler have both been recommended to me but I have not tried them.

- oxplot/pdftilecut: pdftilecut lets you sub-divide a PDF page(s) into smaller pages so you can print them on small form printers. (clever qpdf front-end)
- Printing god damin A0 poster as set of A4's — manual CLI option
- Scaffolded Math and Science: How to Enlarge a PDF into a Multi-Page Poster for FREE! 3 Simple Steps
- PosteRazor - Make your own poster! (no longer seems to work well)

Shawn Graham suggests a`pdftotext`

hack.
First, install`poppler`

using your choice ofpackage manager.
Now,

```
find /MLR -name '*.pdf' -exec sh \
-c 'pdftotext "{}" - | grep --with-filename --label="{}" \
--color "SEARCHTERM"' \;
```

PDFs have a lot of ways of storing data and many of them leave you keeping lots of crap there that you do not need for the current purpose, in the form of, presumably, inefficiently compressed images, excessively high resolution images, or miscellaneous other crap. Slimming them down to the essentials is in general complicated and context-dependent, and I do not know of a general solution. Here are some that work in some contexts.

```
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=output.pdf input.pdf
```

Wrapped up into a nice little script,ShrinkPDF:
(`90`

is the dpi here.)

`./shrinkpdf.sh in.pdf out.pdf 90`

There is alsocpdf and the GUI versionDensify.

Commercial online PDF toolsmallpdf claims to shrink PDFs also. (USD12/month with free trial.)

A peer recommendation:PDF Resizer.

Many business demand “wet ink” signatures on signed PDFs, i.e. that we fill the digital form in, then print it, sign it and then scan it in again. This wastes time, money, paper and adds no discernable security to the truly shitty security mechanism that is signing things with ink. In the hypothetical case that it makes absolutely no legal difference, it could be automated.

I expect that the following methods might be much easier, cheaper, no less secure and for all purposes indistinguishable from printing and scanning a wet ink signature: We can automate fake-printing so that it at least saves time and money. Here are many scripts to simulate a round trip via the printer:

I quite like

`convert -density 100 input.pdf -rotate "$([ $((RANDOM % 2)) -eq 1 ] && echo -)0.$(($RANDOM % 4 + 5))" -attenuate 0.4 +noise Multiplicative -attenuate 0.03 +noise Multiplicative -sharpen 0x1.0 -colorspace Gray output.pdf`

The scripts get much more elaborate than that though. Please do not use them to do anything illegal. Also, if you have a serious contractual obligation that hinges upon the “wet ink” quality of the signature rather than a more substantive mechanism for verifying your identity and consent (such as cryptographic signing, and witnesses) then perhaps you should reconsider the wisdom of the entire project.

OCRMyPDF makes a scanned PDF possibly-searchable and alsooptimizes the size, optionallyaggressively. This will not downsample, but it will get better monochrome compression than normal.

Dangerzone is more extreme still; it deliberately rasterizes the PDF and deletes all metadata to make it anonymous and safe before OCRing it to get the text out:

Dangerzone works like this: You give it a document that you don’t know if you can trust (for example, an email attachment). Inside of a sandbox, dangerzone converts the document to a PDF (if it isn’t already one), and then converts the PDF into raw pixel data: a huge list of RGB color values for each page. Then, in a separate sandbox, dangerzone takes this pixel data and converts it back into a PDF.

Read theblog post for more information.

The PDF that results is less capable and has errors etc but also safe(r).

pdf2svg generates editablevector diagrams from the PDF.

wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely “headless” and do not require a display or display service.

Note also thatWeasyprint, above, does this out of the box.

The use case is, they claim, a (presumably scientific) review. “You reviewed version A of a paper, and receive version B, and wonder what the changes are.” The tool ispdfdiff.

The professional word for laying out the book corectly for the type of binding you intend is*imposition* apparently.
There are many expesnive professional tools to do this and some scrappy free alternatives.

pdfbooklet (Linux/windows) is one

PdfBooklet is a Python script whose first purpose was to create booklet(s) from existing pdf files. It has been extended to many other functions in pdf pages manipulation.

Featuring

- Multiple booklets
- Add blank pages in the beginning in the end
- Adjust scale and margin.

jPDF is also recommended in some circles.

Honourable mention also toimpositioner which seems to be full-featured despite being a short command-line scriptbookletimpose/pdfimposer is a python gtk gui/commandline combo imposer.pdfimpose is a multiple pdf script whatsit.

Commercial options includeDevalipi Imposition Studio andMontax Imposer.

There are at least two options:

None makes it clear which of`TrimBox`

,`BleedBox`

,`Cropbox`

or`ArtBox`

is what I truly want.This might clarify it slightly
but I lost focus around here.

I can add crop marks to a PDF document with different PDF tools, e.g.`pdftk`

.:

- Export the first page with crop marks to a PDF file (your_cropmark.pdf)
- Join it with your PDF document (your_document.pdf) in the command line:

`pdftk your_document.pdf multistamp your_cropmark.pdf output result.pdf`

OR I can set PDF cropping values with GhostScript for printing.

Create a plain text file with the right cropping values — e.g. this is 5mm crop of A4:

`[/CropBox [14.17 14.17 581.1 827.72] /PAGES pdfmark`

Alternatively, use the command line

`gs -c "[/CropBox [14.17 14.17 581.1 827.72] /PAGES pdfmark" \`

Now, convert`my_document.pdf`

using the previous file (which I called`pdfmark.txt`

):

```
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
$OPTIONS \
-c .setpdfwrite \
-sOutputFile=result.pdf \
-f your_document.pdf
pdfmark.txt
```

Nightmares. Colour management is generally complicated.Ghostcript colour management specifically is complicated,
and hasmany moving parts, specifivally
that*rapidly* moving — e.g. the`-dUseCIEColor`

option was removed in ghostscript
9, because it is apparently a broken noob feature.
Its replacement is broken documentation.

I am aware this is a complicated and nuanced area with much special labour involved. But I do not care. If I am working on a project with a graphic designer then they can do this with their skill and training, but for me, I just want a document which prints adequately with some vague approximation of the colours of the screen and no errors. That means, changing to CMYK, or not. No other alterations considered.

CMYK Color conversion of RGB PDF with GhostScript:

```
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
-sColorConversionStrategy=CMYK \
-sColorConversionStrategyForImages=CMYK \
-sDEVICE=pdfwrite \
-dProcessColorModel=/DeviceCMYK \
-dCompatibilityLevel=1.5 \
-sOutputFile=result_cmyk.pdf \
your_document.pdf
```

See also aPDF to TIFF example.

```
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.5 \
-sOutputFile=result_gray.pdf \
your_document.pdf
```

So many of these. There are even more that I have not reviewed, e.g.pypdf2.

Weasyprint seems the cleanest. It converts HTML+CSS into PDF, and is written in pure python. It can be used from the command line orprogramatically.

It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.

```
pip install weasyprint
weasyprint https://weasyprint.org/ weasyprint.pdf
```

svglib provides a pure python library that
can convert SVG to PDF, and a command line utility for same,`svg2pdf`

.
Thus one can, e.g.add SVGs to PDFs in reportlab.

Apropos that,reportlab is the famed monstrous classic way of programatically generating PDFs from code. It includes a modicum of typesetting. It doesn’t edit PDFs so much, but it generates them pretty well. Its integration with other things is often weak — if you thought that inserting LaTeX equations would be simple, or HTML snippets etc. On the other hand it has fancy features such as its own chart generation library. On the third hand, there are better, more widely supported charting libraries that it doesn’t use. Litmus test: Use it if the following feels to you likea natural way to print two columns:

```
from reportlab.platypus import (
BaseDocTemplate,
Frame,
Paragraph,
PageBreak,
PageTemplate )
from reportlab.lib.styles import getSampleStyleSheet
import random
words = (
"lorem ipsum dolor sit amet consetetur "
"sadipscing elitr sed diam nonumy eirmod "
"tempor invidunt ut labore et").split()
styles=getSampleStyleSheet()
Elements=[]
doc = BaseDocTemplate(
'basedoc.pdf',
showBoundary=1)
#Two Columns
frame1 = Frame(
doc.leftMargin,
doc.bottomMargin,
doc.width/2-6,
doc.height,
id='col1')
frame2 = Frame(
doc.leftMargin+doc.width/2+6,
doc.bottomMargin,
doc.width/2-6,
doc.height,
id='col2')
Elements.append(
Paragraph(
" ".join([random.choice(words) for i in range(1000)]),
styles['Normal']))
doc.addPageTemplates([
PageTemplate(id='TwoCol',frames=[frame1,frame2]),
])
#start the construction of the pdf
doc.build(Elements)
```

pdfrw is a Python library and utility that reads and writes PDF files:

- Operations include subsetting, merging, rotating, modifying metadata, etc. […]
- Can be used either standalone, or in conjunction with reportlab to reuse existing PDFs in new ones

Here is a gentle HOWTO. You can use it toput matplotlib plots in reportlab PDFs, getting the best of two bad worlds.

scribus is a reasonable open-source desktop publishing tool. If your content is not amenable to automatic layout out it is a good choice, fore.g. posters. It includes aPython API, albeit a reputedlyquirky one, which is AFAICT Python 2. For all that, it’s a simple and interactive way of generating PDFs programmatically, so might be worth it.

My currentsmart pen is*Livescribe*, a ball-point pen that remembers what I write if I use special paper.
Livescribe is a company which usesAnoto dot patterns.
There were a few other pens which use this technology; I believe Livescribe is the last one standing.

NB: The company is still in operation. Sometimes it is hard to tell, because they seem to be bad at communicating this idea; the US site is currently operational (although seemed to have no stock for a while there) but international sites, last time I checked, were ghost towns and had no information about what was going on, and would refuse to shop products to you.

The Livescribe is a good stylus input. Some models have some bonus features I don’t use, such as audio recording. You can find more about such features on the internet, but I will say nothing further on that due to not caring. I want to write stuff and digitize.

The experience of writing itself is smooth. The pen has a big memory, so I do not need to sync to smartphone often.

The major weakness is, the smartphone app is*obnoxiously* bad: awful, clunky and crashes regularly, and infrequently updated.~~A real mean sting is that it plays mandatory advertisements for other Livescribe products before it lets me use the one I just now paid a lot of money for.~~~~The former problems I could attribute to incompetence, but playing mandatory advertising in a way that degrade the usefulness of the product is malice, and substantially increases the likelihood that I will switch brands to a competitor.~~
The app no longer plays mandatory advertising for other Livescribe products before it lets me use the one I just now paid a lot of money for.

It transfers data to special app (ios/android) via bluetooth then via a cloud service such as dropbox, google drive, Evernote or OneNote to other devices. Another annoyance: there are noend-to-end encrypted sync services on the list of cloud sync services, so everything I write using the pen can be assumed to be subpoena-vulnerable and spook-readable in some jurisdiction or other.

~~And no, there is~~*no* option on that list that simply transfers files to your hard drive.
It has to go via the hard drive of some unaccountable third party.
As such, do not use it to write notes on anything if you do not wantpolice and spooks reading it.

~~People have been complaining about the app for 7+ years now and it has remained largely unchanged apart from fixes for crashing bugs and compatibility updates, so do not buy the pen in the hope that the app will improve.
If this were a priority for the company they would have demonstrated that before now.~~
There is a “share” option which allows manual, (more?) secure export via other apps.

A real frustration is that the writing experience for this device is so good, so simple and easy and intuitive, that it makes the unnecessary pain points introduced by their terrible software even more vexing by contrast.

This is*so close* to an amazing, luxurious experience, but they dropped the ball just before the goal.
It is as if you are an aspirational cook, and a Michelin star chef lends you their staff and kitchen and you can use it all you want, but it is on fire.

**UPDATE:**There is a desktop app called*Livescribe+* which circumvents some of my criticism (e.g. going via cloud for no reason) and works on the more recent models. I have not tried it yet.

Aside: What is with the app naming? If I bought a laptop and it came with an app called “Laptop+” that was essential to executing basic fucntions, I would be suspicious.

**tl;dr**
Great hardware, awful software, the company doesn’t seem to care about non-USA customers sufficiently to reliably supply them.

Here are some links I’ve needed for the Livescribe; they are probably obsolete: one of Livescribe’s many irritating habits is suddenly migrating content to new places without redirects.

~~Software downloads~~deleted without forwarding link.- Un-pairing Livescribe 3 from a device or multiple devices (e.g. because your phone died).
~~New dot paper for self-printing.~~They deleted that and left no forwarding address. Here it is:Printing free Livescribe dot paper and controls online- Ink cartridges are somewhat expensive and take a long time to arrive, so I have learnt how to unclog ballpoints in order to keep them going.Soaking them in isopropyl alcohol seems to have the best success rate.

Question: how bad a security breach is it to lose one of these pens? I suspect very bad, since it will remember a LOT and has essentially no authentication.

I really love the Anoto dot pen technology. Their hardware is wonderful. I wonder if there is a company which is better at deploying it than is Livescribe?Apparently Anoto have made a few mis-steps in their commercialisation and generally do not have a better spin-off technology licensee, or at least, searching for the alternatives I found on Wikipedia led only to discontinued products.

Various ways of writing to get input into the computer. A fuzzy category. I have divided it into single-purpose graphics pads (e.g. wacom pens), tablet computers (.e. iPads) and smartpens, whcih write on paper but then record what happened using some clever tricks.

Various inputs pads! Wacom is the most recognisable brand, but they are crazy expensive and their drivers are awful on macOS, working unreliably and aggressively enforcing obsolescence. Their competitors’ drivers can surely be any no worse and their tablets cost less. Which competitor is least worst?

- DesignModo’s Wacom Alternatives showdown (
**tl;dr**Huion seems OK) - Of course Reddit has an opinion
- The Huion tablets seem to work on linux in 2020 at least (1,2). If you exist in a different year good luck to you.

Richard Zach recommends StylusLabsWrite as the input app, a minimalist sketchpad app that looks convenient for maths and other textual-type stuff. Windows/Mac/Linux.

A stylus pad which is also a screen. Some of these are popular amongst my colleagues, since they allow you to see what you are writing as you write it , and to annotate existing documents. They have gotten more affordable in recent times. Cheap entry level options which I have not used myself:

Ones that require multiple plugs are suspect— USB+HDMI+power is a lot of cables to manage. A single USB C should suffice these day, right?

Or possibly you already own a touchscreen if you own a tablet computer. See next.

Anyway, I use the Huion Kamvas 13-inch 2022, which is a high-accuracy stylus input.

Pros: cheap, adequate.

Cons: display could be brighter. The single-cable USB-C connector doesn't work under macos, so I need to use the “3-in-1” cable which requires additionally plugging in elsewhere for power and occupying an HDMI input port.

I am fond of Android-tablet-as-screen, since it is a cheap device with touch interface and visual feedback. In particular, manyeReaders are android devices which could notionally share their touchscreen with a laptop.

There arevarious apps that share android tablets to mac/windows desktop if you want to link to a desktop app. None for linux desktop though AFAICS.

As stand-alone devices I quite like the workflow of these computers.

I have had poor luck using these tablet computers for real-time input to a desktop machine, as one might use a graphics pad. In the absence of a native app, one might hope that you could point its internal browser at some webapp for strategic functionality, e.g. anonline whiteboard. This sounds like a good idea, although in practice it has not worked well for me. I have an android tablet right and I am logged in toExcalidraw via the tablet browser. It is not great. Without keyboard shortcuts the workflow is clunky. And the responsiveness via the android browser is poor. Maybe if there were a native app this would be better? Or if it were not roundtripping via servers in Singapore or something? Or if there were fancier handling of concurrency?

Slightly left-field. The most hip device for tablet-computer-as-graphics-tablet input seems to be the Remarkable, which hasmany plugins doing clever things with its open API, so it occupies a kind of intermediate niche that it shares much with classic graphics pad devices as well as tablet computers.

Richard Zach is once again helpful, settling on a hacky solution that gets output from his reMarkable tablet viasrvfb. That particular solution isdiscontinued but there seem to be several like it.

How can I combine a real paper notebook with digitization? i.e. can I have a pen that writes on paper but also digitises the pen strokes? Great questions, self. Yes, yes I can.

Previously the Livescribe was the only prominent option; now there are many options. See various listicle reviews, e.g.Best smart pen 2021: Tools for smarter note-taking

I am not a massive fan of ball-points byt AFAICT that is the only game in town. (Smart-fountain-pen seems too niche, but if you are in the marketing department for one of these companies…)

I use this one a lot. The fact that it is fucntional despite its crappy software is a testament to the engineering triumph it represents. SeeLivescribe.

TBD. Moleskine makes smart pens that look a lot like the Livescribe. They are apparently available globally. Tech looks very similar to the Livescribe in practice. Although Livescribe had a decade-long head start they did not spend it wisely in developing good software.

TBD. Portable wacom input device meets notepad. Cheapest option amongst the smartpens, it seems. The input tech uses a special wacom tablet backing but any old paper, so the cost of the consumables is presumably low.

Dictation oreye tracking systems.

Turning my scribbles on normal paper into markup, a.k.a. “Reverse LaTeX”.

**tl;dr** Riding, as compared to driving a car, makes me rich and happy. YMMV.

I ride to work, and in fact most places.^{1}
This is a good deal.
Let me run the numbers.

My bike costs about AUD500 in maintenance per year, maybe up to AUD1000 if I want to factor in an amortized replacement cost.

Owning a small carcosts about AUD6200 per year plus AUD0.16/km I drive in it, plus parking fees. For my modest commuting needs, the fuel costs would be something like AUD500 at that price. Parking where I work is astronomically expensive and I won’t even include it here because it would bias the numbers against cars too much.

But wait!
The decisive cost for me in driving a car is time.
By riding everywhere I log about 2-6 hours of cardio every week (call it 3), and generally feel great.
If I drove places instead in Sydney traffic, I would save some (but not much) time in transit, but every moment of that transit would be dead time, doing an activity I do not enjoy instead of one that I both enjoy and need.
I would arrive at work sleepy and also I would*still* want to do the cardio I missed so I could feel alive.
Realistically I would save maybe 1 hour of travel time, but spend 3 more hours in the gym, for a net loss of 2 hours every week from my precious brief life, not even counting that gym cardio is joyless compared to doing a real activity in the world.
Assuming median Australian wages, the value of the lost time driving a car everywhere is about AUD4000 annually.

Thus, my choice to ride a bike is worth around AUD9500 to me per year, compared to a choice to drive car — that’s around 26 bucks per day. That is before I factor in the vitamin D and sleep regulation benefits of getting outdoors, and the sense of wellbeing that comes from moving my body, and the tight buns. Let’s assume that if we lump those positive factors in with some other negative ones — the increased risks of injury during traffic accidents, say — then in aggregate those factors are a wash. Then we are done. Riding a bike is worth AUD26/day, paid out to me in the coin of avoided cost and happiness.

The collected jwz bicycle wisdom is one guide to no-bullshit biking. Top tips include

“City bikes” and “road bikes” are designed for some Jetsons-slick hypothetical future city that I’ve never seen. Or maybe for the bike paths in Los Altos or something. Here in real cities, roads are shit, and if you want your wheels and tires to survive curbs and potholes, you need a hybrid. They’re a little heavier and a little slower. Are you racing? No? Then you don’t care.

Safety: I follow the Zodiac approach: always assume the cars can see you perfectly, and are trying to kill you. If an intersection seems iffy, use the sidewalk and crosswalks. If big streets like Market and Van Ness freak you out, there are always less traficky ways to go, or just stay on the sidewalks.

Do whatever you need to do to feel safe. You have nobody to impress.

- Ren Willis’s5 Laws Of Bicycling Survival
- Don't Salmon, Don't Shoal: Learning The Lingo Of Safe Cycling

Fixing your bike? The 90stastic bike resource, useful, lucid and guilelessly messy, is the marvellousSheldon Brown’s how-to guide (e.g. How do I work withcantilever brakes?) As bike technology moves forwards it is getting less useful.

These look like some fun links:

- Freshly out:Scout Bike Alarm & Finder by Knog is an iOs-integrated one.
- Cycling Weekly,Best GPS Bike Trackers: find and follow your stolen bike
- Sherlock: The ultimate GPS anti-theft device for bikes
- The 4 Best Bike Trackers for Catching Thieves Red Handed
- Cyclingnews,Best bike GPS trackers: Give yourself the best chance of a stolen-bike reunion
- road.cc,Review: Vodafone Curve bike light & GPS tracker
- Vodafone’s new bike light says a lot about how much cycling has grown

Ding lights are Australian-designed lights that include an integrated downlight to illuminate the road and the cyclist, assisting the cyclist and approaching drivers. AUD170. Review: lighting is excellent, I wish they needed less frequent recharging.

Knog does bike lights with well-designed removable clips.

Omafiets has pretty much everything I could want in the realm of bike bits. They tend towards the boutique hipster accessories which can get quite pricey. Their rates for parts are good by Sydney standards though.

~~Linusbikes
and their attempt at non-ugly bike luggage.~~ No longer a thing in Australia, and the American site claims to ship to Australia but actually won’t let you choose it as a destination on the shipping page.

Wiggle is a classic cycling stuff vendor which is not cheap but at least cheaper than Australia.

Essential in these desertified, windy times. Seegoggles.

- Railtrails
- Bikemap Apps
- Mountain Biking & Cycling NSW
- Trailforks
- The Omafiets mob areblogging their favourite routes around Sydney
- Prologue 500 is a 500km race near Canberra

A.k.a.*Those original 80s/90s solar powered computer bikes that I always forget the names of*.

The names I’m looking for are…Winnebiko andBEHEMOTH, created by Technomad, Steven K. Roberts a.k.a.microship. He went on to make boats that are even more ridiculously geeked out than were his bikes.

Borrell, Brendan. 2016.“The Bicycle Problem That Nearly Broke Mathematics.”*Nature News* 535 (7612): 338.

Wilson, David Gordon, Theodor Schmidt, and Jim Papadopoulos. 2020.*Bicycling Science*. Fourth edition. Cambridge, Massachusetts ; London, England: The MIT Press.

I do sometimes drive, mind, but usually when I move furniture or music gear. For that, I rent a van. That cost does not factor in to my calculations here, because if I owned a car it would not be a van, so I would

*still*be renting a van to move furniture. Good options for renting vehicles in Sydney includeGoGet andCar Next Door.↩︎

Placeholder.

AFAICS, generative models using score-matching to learn and Langevin MCMC to sample. There are various tricks needed to to do it with successive denoising steps and interpretation in terms of SDEs. I am vaguely aware that this oversimplifies a rich and interesting history of convergence of many useful techniques, but not invested enough to perform a reconstruction upon the details.

Denoising score matchingHyvärinen (2005). Seescore matching orMcAllester (2023) for an introduction.

- Lilian Weng,What are Diffusion Models?
- Yang Song,Generative Modeling by Estimating Gradients of the Data Distribution
- Sander Dieleman,Diffusion models are autoencoders
- Denoising Diffusion-based Generative Modeling: Foundations and Applications
- Tutorial on Denoising Diffusion-based Generative Modeling: Foundations and Applications
- What’s the score? (Review of latest Score Based Generative Modeling papers.)
- Anil Ananthaswamy,The Physics Principle That Inspired Modern AI Art

Suggestive connection tothermodynamics(Sohl-Dickstein et al. 2015).

Anderson, Brian D. O. 1982.“Reverse-Time Diffusion Equation Models.”*Stochastic Processes and Their Applications* 12 (3): 313–26.

Dhariwal, Prafulla, and Alex Nichol. 2021.“Diffusion Models Beat GANs on Image Synthesis.”*arXiv:2105.05233 [Cs, Stat]*, June.

Dutordoir, Vincent, Alan Saul, Zoubin Ghahramani, and Fergus Simpson. 2022.“Neural Diffusion Processes.” arXiv.

Han, Xizewen, Huangjie Zheng, and Mingyuan Zhou. 2022.“CARD: Classification and Regression Diffusion Models.” arXiv.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020.“Denoising Diffusion Probabilistic Models.”*arXiv:2006.11239 [Cs, Stat]*, December.

Hoogeboom, Emiel, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, and Tim Salimans. 2021.“Autoregressive Diffusion Models.”*arXiv:2110.02037 [Cs, Stat]*, October.

Hyvärinen, Aapo. 2005.“Estimation of Non-Normalized Statistical Models by Score Matching.”*The Journal of Machine Learning Research* 6 (December): 695–709.

Jalal, Ajil, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G Dimakis, and Jon Tamir. 2021.“Robust Compressed Sensing MRI with Deep Generative Priors.” In*Advances in Neural Information Processing Systems*, 34:14938–54. Curran Associates, Inc.

Jolicoeur-Martineau, Alexia, Rémi Piché-Taillefer, Ioannis Mitliagkas, and Remi Tachet des Combes. 2022.“Adversarial Score Matching and Improved Sampling for Image Generation.” In.

McAllester, David. 2023.“On the Mathematics of Diffusion Models.” arXiv.

Nichol, Alex, and Prafulla Dhariwal. 2021.“Improved Denoising Diffusion Probabilistic Models.”*arXiv:2102.09672 [Cs, Stat]*, February.

Pascual, Santiago, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, and Joan Serrà. 2022.“Full-Band General Audio Synthesis with Score-Based Diffusion.” arXiv.

Sharrock, Louis, Jack Simons, Song Liu, and Mark Beaumont. 2022.“Sequential Neural Score Estimation: Likelihood-Free Inference with Conditional Score Based Diffusion Models.” arXiv.

Sohl-Dickstein, Jascha, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015.“Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.”*arXiv:1503.03585 [Cond-Mat, q-Bio, Stat]*, November.

Song, Jiaming, Chenlin Meng, and Stefano Ermon. 2021.“Denoising Diffusion Implicit Models.”*arXiv:2010.02502 [Cs]*, November.

Song, Yang, Conor Durkan, Iain Murray, and Stefano Ermon. 2021.“Maximum Likelihood Training of Score-Based Diffusion Models.” In*Advances in Neural Information Processing Systems*.

Song, Yang, and Stefano Ermon. 2020a.“Generative Modeling by Estimating Gradients of the Data Distribution.” In*Advances In Neural Information Processing Systems*. arXiv.

———. 2020b.“Improved Techniques for Training Score-Based Generative Models.” In*Advances In Neural Information Processing Systems*. arXiv.

Song, Yang, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. 2019.“Sliced Score Matching: A Scalable Approach to Density and Score Estimation.” arXiv.

Song, Yang, Liyue Shen, Lei Xing, and Stefano Ermon. 2022.“Solving Inverse Problems in Medical Imaging with Score-Based Generative Models.” In. arXiv.

Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2022.“Score-Based Generative Modeling Through Stochastic Differential Equations.” In.

Swersky, Kevin, Marc’Aurelio Ranzato, David Buchman, Nando D. Freitas, and Benjamin M. Marlin. 2011.“On Autoencoders and Score Matching for Energy Based Models.” In*Proceedings of the 28th International Conference on Machine Learning (ICML-11)*, 1201–8.

Torres, Susana Vázquez, Philip J. Y. Leung, Isaac D. Lutz, Preetham Venkatesh, Joseph L. Watson, Fabian Hink, Huu-Hien Huynh, et al. 2022.“De Novo Design of High-Affinity Protein Binders to Bioactive Helical Peptides.” bioRxiv.

Vincent, Pascal. 2011.“A connection between score matching and denoising autoencoders.”*Neural Computation* 23 (7): 1661–74.

Watson, Joseph L., David Juergens, Nathaniel R. Bennett, Brian L. Trippe, Jason Yim, Helen E. Eisenach, Woody Ahern, et al. 2022.“Broadly Applicable and Accurate Protein Design by Integrating Structure Prediction Networks and Diffusion Generative Models.” bioRxiv.

Yang, Ling, Zhilong Zhang, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Ming-Hsuan Yang, and Bin Cui. 2022.“Diffusion Models: A Comprehensive Survey of Methods and Applications.” arXiv.

Junction for various bayesian methods where the estimands are functions over some sintunuous argument space.

I would like to readTerenin on GPs on Manifolds who also makes a suggestive connection toSDEs, which is thefiltering GPs trick again.

🏗

Seeneural processes.

Alexanderian, Alen. 2021.“Optimal Experimental Design for Infinite-Dimensional Bayesian Inverse Problems Governed by PDEs: A Review.”*arXiv:2005.12998 [Math]*, January.

Bostan, E., U. S. Kamilov, M. Nilchian, and M. Unser. 2013.“Sparse Stochastic Processes and Discretization of Linear Inverse Problems.”*IEEE Transactions on Image Processing* 22 (7): 2699–2710.

Bui-Thanh, Tan, Omar Ghattas, James Martin, and Georg Stadler. 2013.“A Computational Framework for Infinite-Dimensional Bayesian Inverse Problems Part I: The Linearized Case, with Application to Global Seismic Inversion.”*SIAM Journal on Scientific Computing* 35 (6): A2494–2523.

Dashti, Masoumeh, Stephen Harris, and Andrew Stuart. 2011.“Besov Priors for Bayesian Inverse Problems.” arXiv.

Dashti, Masoumeh, and Andrew M. Stuart. 2015.“The Bayesian Approach To Inverse Problems.”*arXiv:1302.6989 [Math]*, July.

Dubrule, Olivier. 2018.“Kriging, Splines, Conditional Simulation, Bayesian Inversion and Ensemble Kalman Filtering.” In*Handbook of Mathematical Geosciences: Fifty Years of IAMG*, edited by B.S. Daya Sagar, Qiuming Cheng, and Frits Agterberg, 3–24. Cham: Springer International Publishing.

Grigorievskiy, Alexander, Neil Lawrence, and Simo Särkkä. 2017.“Parallelizable Sparse Inverse Formulation Gaussian Processes (SpInGP).” In*arXiv:1610.08035 [Stat]*.

Jo, Hyeontae, Hwijae Son, Hyung Ju Hwang, and Eun Heui Kim. 2020.“Deep Neural Network Approach to Forward-Inverse Problems.”*Networks & Heterogeneous Media* 15 (2): 247.

Knapik, B. T., A. W. van der Vaart, and J. H. van Zanten. 2011.“Bayesian Inverse Problems with Gaussian Priors.”*The Annals of Statistics* 39 (5).

Lasanen, S, and L Roininen. 2005.“Statistical Inversion with Green’s Priors.” In*Proceedings of the 5th International Conference on Inverse Problems in Engineering: Theory and Practice, Cambridge, UK*, 11.

Lassas, Matti, and Samuli Siltanen. 2004.“Can One Use Total Variation Prior for Edge-Preserving Bayesian Inversion?”*Inverse Problems* 20 (5): 1537–63.

Liu, Xiao, Kyongmin Yeo, and Siyuan Lu. 2020.“Statistical Modeling for Spatio-Temporal Data From Stochastic Convection-Diffusion Processes.”*Journal of the American Statistical Association* 0 (0): 1–18.

Louizos, Christos, Xiahan Shi, Klamer Schutte, and Max Welling. 2019.“The Functional Neural Process.” In*Advances in Neural Information Processing Systems*. Vol. 32. Curran Associates, Inc.

Magnani, Emilia, Nicholas Krämer, Runa Eschenhagen, Lorenzo Rosasco, and Philipp Hennig. 2022.“Approximate Bayesian Neural Operators: Uncertainty Quantification for Parametric PDEs.” arXiv.

Mosegaard, Klaus, and Albert Tarantola. 2002.“Probabilistic Approach to Inverse Problems.” In*International Geophysics*, 81:237–65. Elsevier.

Perdikaris, Paris, and George Em Karniadakis. 2016.“Model inversion via multi-fidelity Bayesian optimization: a new paradigm for parameter estimation in haemodynamics, and beyond.”*Journal of the Royal Society, Interface* 13 (118): 20151107.

Petra, Noemi, James Martin, Georg Stadler, and Omar Ghattas. 2014.“A Computational Framework for Infinite-Dimensional Bayesian Inverse Problems, Part II: Stochastic Newton MCMC with Application to Ice Sheet Flow Inverse Problems.”*SIAM Journal on Scientific Computing* 36 (4): A1525–55.

Phillips, Angus, Thomas Seror, Michael John Hutchinson, Valentin De Bortoli, Arnaud Doucet, and Emile Mathieu. 2022.“Spectral Diffusion Processes.” In.

Pielok, Tobias, Bernd Bischl, and David Rügamer. 2023.“Approximate Bayesian Inference with Stein Functional Variational Gradient Descent.” In.

Pikkarainen, Hanna Katriina. 2006.“State Estimation Approach to Nonstationary Inverse Problems: Discretization Error and Filtering Problem.”*Inverse Problems* 22 (1): 365–79.

Pinski, F. J., G. Simpson, A. M. Stuart, and H. Weber. 2015.“Kullback-Leibler Approximation for Probability Measures on Infinite Dimensional Spaces.”*SIAM Journal on Mathematical Analysis* 47 (6): 4091–4122.

Sigrist, Fabio, Hans R. Künsch, and Werner A. Stahel. 2015.“Spate : An R Package for Spatio-Temporal Modeling with a Stochastic Advection-Diffusion Process.” Application/pdf.*Journal of Statistical Software* 63 (14).

Singh, Gautam, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. 2019.“Sequential Neural Processes.”*arXiv:1906.10264 [Cs, Stat]*, June.

Song, Yang, Liyue Shen, Lei Xing, and Stefano Ermon. 2022.“Solving Inverse Problems in Medical Imaging with Score-Based Generative Models.” In. arXiv.

Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2022.“Score-Based Generative Modeling Through Stochastic Differential Equations.” In.

Sun, Shengyang, Guodong Zhang, Jiaxin Shi, and Roger Grosse. 2019.“Functional Variational Bayesian Neural Networks.” In.

Tran, Ba-Hien, Simone Rossi, Dimitrios Milios, and Maurizio Filippone. 2022.“All You Need Is a Good Functional Prior for Bayesian Deep Learning.”*Journal of Machine Learning Research* 23 (74): 1–56.

Unser, M. 2015.“Sampling and (Sparse) Stochastic Processes: A Tale of Splines and Innovation.” In*2015 International Conference on Sampling Theory and Applications (SampTA)*, 221–25.

Unser, Michael A., and Pouya Tafti. 2014.*An Introduction to Sparse Stochastic Processes*. New York: Cambridge University Press.

Unser, M., P. D. Tafti, A. Amini, and H. Kirshner. 2014.“A Unified Formulation of Gaussian Vs Sparse Stochastic Processes - Part II: Discrete-Domain Theory.”*IEEE Transactions on Information Theory* 60 (5): 3036–51.

Unser, M., P. D. Tafti, and Q. Sun. 2014.“A Unified Formulation of Gaussian Vs Sparse Stochastic Processes—Part I: Continuous-Domain Theory.”*IEEE Transactions on Information Theory* 60 (3): 1945–62.

Valentine, Andrew P, and Malcolm Sambridge. 2020a.“Gaussian Process Models—I. A Framework for Probabilistic Continuous Inverse Theory.”*Geophysical Journal International* 220 (3): 1632–47.

———. 2020b.“Gaussian Process Models—II. Lessons for Discrete Inversion.”*Geophysical Journal International* 220 (3): 1648–56.

Wang, Ziyu, Tongzheng Ren, Jun Zhu, and Bo Zhang. 2018.“Function Space Particle Optimization for Bayesian Neural Networks.” In.

Watson, Joe, Jihao Andreas Lin, Pascal Klink, and Jan Peters. 2020.“Neural Linear Models with Functional Gaussian Process Priors.” In, 10.

Yang, Liu, Xuhui Meng, and George Em Karniadakis. 2021.“B-PINNs: Bayesian Physics-Informed Neural Networks for Forward and Inverse PDE Problems with Noisy Data.”*Journal of Computational Physics* 425 (January): 109913.

Yang, Liu, Dongkun Zhang, and George Em Karniadakis. 2020.“Physics-Informed Generative Adversarial Networks for Stochastic Differential Equations.”*SIAM Journal on Scientific Computing* 42 (1): A292–317.

Zammit-Mangion, Andrew, Michael Bertolacci, Jenny Fisher, Ann Stavert, Matthew L. Rigby, Yi Cao, and Noel Cressie. 2021.“WOMBAT v1.0: A fully Bayesian global flux-inversion framework.”*Geoscientific Model Development Discussions*, July, 1–51.

Zhang, Dongkun, Ling Guo, and George Em Karniadakis. 2020.“Learning in Modal Space: Solving Time-Dependent Stochastic PDEs Using Physics-Informed Neural Networks.”*SIAM Journal on Scientific Computing* 42 (2): A639–65.

Regression using non-Gaussian random fields. GeneralisedGaussian process regression.

Is there ever an actual need for this?
Or can we just use mostly-Gaussian process with some non-Gaussian distribution
marginal and pretend, viaGP quantile regression, or some variational GP approximation or non-Gaussian likelihood over Gaussian latents.
Presumably if we suspect higher moments than the second are important, or that there is some actual stochastic process that we*know* matches our phenomenon, we might
bother with this, but oh my it can get complicated.

TO: example, maybe usingsparse stochastic process priors,Neural process regression; is that distinctSingh et al. (2019)?

Bostan, E., U. S. Kamilov, M. Nilchian, and M. Unser. 2013.“Sparse Stochastic Processes and Discretization of Linear Inverse Problems.”*IEEE Transactions on Image Processing* 22 (7): 2699–2710.

Louizos, Christos, Xiahan Shi, Klamer Schutte, and Max Welling. 2019.“The Functional Neural Process.” In*Advances in Neural Information Processing Systems*. Vol. 32. Curran Associates, Inc.

Singh, Gautam, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. 2019.“Sequential Neural Processes.”*arXiv:1906.10264 [Cs, Stat]*, June.

Unser, M. 2015.“Sampling and (Sparse) Stochastic Processes: A Tale of Splines and Innovation.” In*2015 International Conference on Sampling Theory and Applications (SampTA)*, 221–25.

Unser, Michael A., and Pouya Tafti. 2014.*An Introduction to Sparse Stochastic Processes*. New York: Cambridge University Press.

Unser, M., P. D. Tafti, A. Amini, and H. Kirshner. 2014.“A Unified Formulation of Gaussian Vs Sparse Stochastic Processes - Part II: Discrete-Domain Theory.”*IEEE Transactions on Information Theory* 60 (5): 3036–51.

Unser, M., P. D. Tafti, and Q. Sun. 2014.“A Unified Formulation of Gaussian Vs Sparse Stochastic Processes—Part I: Continuous-Domain Theory.”*IEEE Transactions on Information Theory* 60 (3): 1945–62.

Ggaussian process regression, but with neural nets approximating the kernel function and some other tricky bits. This has been pitched for me specifically as ameta-learning technique.

Jha et al. (2022):

The uncertainty-aware Neural Process Family (NPF)(Garnelo, Rosenbaum, et al. 2018;Garnelo, Schwarz, et al. 2018) aims to address the aforementioned limitations of the Bayesian paradigm by exploiting the function approximation capabilities of deep neural networks to learn a family of real-world data-generating processes, a.k.a., stochastic Gaussian processes (GPs)(Rasmussen and Williams 2006). Neural processes (NPs) define uncertainties in predictions in terms of a conditional distribution over functions given the context (observations)\(C\) drawn from a distribution of functions. Here, each function\(f\) is parameterized using neural networks and can be thought of capturing an underlying data generating stochastic process.

To model the variability of\(f\) based on the variability of the generated data, NPs concurrently train and test their learned parameters on multiple datasets. This endows them with the capability to meta learn their predictive distributions over functions. The meta-learning setup makes NPs fundamentally distinguished from other non-Bayesian uncertainty-aware learning frameworks like stochastic GPs. NPF members thus combine the best of meta learners, GPs and neural networks. Like GPs, NPs learn a distribution of functions, quickly adapt to new observations, and provide uncertainty measures given test time observations. Like neural networks, NPs learn function approximation from data directly besides being efficient at inference. To learn\(f\), NPs incorporate the encoder-decoder architecture that comprises a functional encoding of each observation point followed by the learning of a decoder function whose parameters are capable of unraveling the unobserved function realizations to approximate the outputs of\(f\)…. Despite their resemblance to NPs, the vanilla encoder-decoder networks traditionally based on CNNs, RNNs, and Transformers operate merely on pointwise inputs and clearly lack the incentive to meta learn representations for dynamically changing functions (imagine\(f\) changing over a continuum such as time) and their families. The NPF members not only improve upon these architectures to model functional input spaces and provide uncertainty-aware estimates but also offer natural benefits to a number of challenging real-world tasks. Our study brings into light the potential of NPF models for several such tasks including but not limited to the handling of missing data, handling off-the-grid data, allowing continual and active learning out-of-the-box, superior interpretation capabilities all the while leveraging a diverse range of task-specific inductive biases.

Not sure if this is different again?(Louizos et al. 2019)

Not sure if different. Seestochastic process regression.

Fortuin, Vincent. 2022.“Priors in Bayesian Deep Learning: A Review.”*International Statistical Review* 90 (3): 563–91.

Garnelo, Marta, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J. Rezende, and S. M. Ali Eslami. 2018.“Conditional Neural Processes.”*arXiv:1807.01613 [Cs, Stat]*, July, 10.

Garnelo, Marta, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. 2018.“Neural Processes,” July.

Holderrieth, Peter, Michael J. Hutchinson, and Yee Whye Teh. 2021.“Equivariant Learning of Stochastic Fields: Gaussian Processes and Steerable Conditional Neural Processes.” In*Proceedings of the 38th International Conference on Machine Learning*, 4297–307. PMLR.

Jha, Saurav, Dong Gong, Xuesong Wang, Richard E. Turner, and Lina Yao. 2022.“The Neural Process Family: Survey, Applications and Perspectives.” arXiv.

Lin, Xixun, Jia Wu, Chuan Zhou, Shirui Pan, Yanan Cao, and Bin Wang. 2021.“Task-Adaptive Neural Process for User Cold-Start Recommendation.” In*Proceedings of the Web Conference 2021*, 1306–16. WWW ’21. New York, NY, USA: Association for Computing Machinery.

Louizos, Christos, Xiahan Shi, Klamer Schutte, and Max Welling. 2019.“The Functional Neural Process.” In*Advances in Neural Information Processing Systems*. Vol. 32. Curran Associates, Inc.

Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006.*Gaussian Processes for Machine Learning*. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press.

Singh, Gautam, Jaesik Yoon, Youngsung Son, and Sungjin Ahn. 2019.“Sequential Neural Processes.”*arXiv:1906.10264 [Cs, Stat]*, June.

A meeting point for some related ideas from different fields.
Perspectives on analysing systems in terms of a latent, noisy state, and/or their history of noisy observations.
This notebook is dedicated to the possibly-surprising fact we can move between*hidden-state*-type representations, and*observed-state-only* representations, and indeed mix them together conveniently.
I have many thoughts about this, but nothing concrete to write down at the moment.

Seelinear feedback systems andlinear filter design. for stuff about FIR vs IIR filters.

Let us talk about Fourier transforms and spectral properties.

Learning state is pointless! infer directly from observations! SeeKoopmania.

Hochreiter et al. (2001);Hochreiter (1998);Lamb et al. (2016);Hardt, Ma, and Recht (2018) etc

Interesting package of tools from Christopher Ré’s lab, at the intersection ofrecurrent networks and . SeeHazyResearch/state-spaces: Sequence Modeling with Structured State Spaces. I find these aesthetically satisfying, because I spent 2 years of my PhD trying to solvethe same problem, and failed. These folks did a better job, so I find it slightly validating that the idea was not stupid.Gu et al. (2021):

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence u↦y by simply simulating a linear continuous-time state-space representation x˙=Ax+Bu,y=Cx+Du. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices A that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences.

Gu, Goel, and Ré (2021):

A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of 10000 or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM)\(x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t)\), and showed that for appropriate choices of the state matrix\(A\), this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning\(A\) with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation 60× faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.

Related?Li et al. (2022)

Interesting parallel to the recursive/non-recursive transformer duality inHow the RWKV language models. Question: Can they do the jobs of transformers? Nearly(Vardasbi et al. 2023).

Arjovsky, Martin, Amar Shah, and Yoshua Bengio. 2016.“Unitary Evolution Recurrent Neural Networks.” In*Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48*, 1120–28. ICML’16. New York, NY, USA: JMLR.org.

Atal, B. S. 2006.“The History of Linear Prediction.”*IEEE Signal Processing Magazine* 23 (2): 154–61.

Ben Taieb, Souhaib, and Amir F. Atiya. 2016.“A Bias and Variance Analysis for Multistep-Ahead Time Series Forecasting.”*IEEE transactions on neural networks and learning systems* 27 (1): 62–76.

Bengio, Y., P. Simard, and P. Frasconi. 1994.“Learning Long-Term Dependencies with Gradient Descent Is Difficult.”*IEEE Transactions on Neural Networks* 5 (2): 157–66.

Bordes, Antoine, Léon Bottou, and Patrick Gallinari. 2009.“SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent.”*Journal of Machine Learning Research* 10 (December): 1737–54.

Cakir, Emre, Ezgi Can Ozan, and Tuomas Virtanen. 2016.“Filterbank Learning for Deep Neural Network Based Polyphonic Sound Event Detection.” In*Neural Networks (IJCNN), 2016 International Joint Conference on*, 3399–3406. IEEE.

Chang, Bo, Minmin Chen, Eldad Haber, and Ed H. Chi. 2019.“AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks.” In*Proceedings of ICLR*.

Chang, Bo, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. 2018.“Reversible Architectures for Arbitrarily Deep Residual Neural Networks.” In*arXiv:1709.03698 [Cs, Stat]*.

Chang, Bo, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. 2018.“Multi-Level Residual Networks from Dynamical Systems View.” In*PRoceedings of ICLR*.

Chung, Junyoung, Sungjin Ahn, and Yoshua Bengio. 2016.“Hierarchical Multiscale Recurrent Neural Networks.”*arXiv:1609.01704 [Cs]*, September.

Chung, Junyoung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015.“A Recurrent Latent Variable Model for Sequential Data.” In*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–88. Curran Associates, Inc.

Collins, Jasmine, Jascha Sohl-Dickstein, and David Sussillo. 2016.“Capacity and Trainability in Recurrent Neural Networks.” In*arXiv:1611.09913 [Cs, Stat]*.

Cooijmans, Tim, Nicolas Ballas, César Laurent, Çağlar Gülçehre, and Aaron Courville. 2016.“Recurrent Batch Normalization.”*arXiv Preprint arXiv:1603.09025*.

Doucet, Arnaud, Nando Freitas, and Neil Gordon. 2001.*Sequential Monte Carlo Methods in Practice*. New York, NY: Springer New York.

Fraccaro, Marco, Sø ren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. 2016.“Sequential Neural Models with Stochastic Layers.” In*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2199–2207. Curran Associates, Inc.

Goodwin, M M, and M Vetterli. 1999.“Matching Pursuit and Atomic Signal Models Based on Recursive Filter Banks.”*IEEE Transactions on Signal Processing* 47 (7): 1890–1902.

Grosse, Roger, Rajat Raina, Helen Kwong, and Andrew Y. Ng. 2007.“Shift-Invariant Sparse Coding for Audio Classification.” In*The Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI2007)*, 9:8.

Gu, Albert, Karan Goel, and Christopher Ré. 2021.“Efficiently Modeling Long Sequences with Structured State Spaces.”

Gu, Albert, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. 2021.“Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In*Advances in Neural Information Processing Systems*, 34:572–85. Curran Associates, Inc.

Haber, Eldad, and Lars Ruthotto. 2018.“Stable Architectures for Deep Neural Networks.”*Inverse Problems* 34 (1): 014004.

Hardt, Moritz, Tengyu Ma, and Benjamin Recht. 2018.“Gradient Descent Learns Linear Dynamical Systems.”*The Journal of Machine Learning Research* 19 (1): 1025–68.

Haykin, Simon S., ed. 2001.*Kalman Filtering and Neural Networks*. Adaptive and Learning Systems for Signal Processing, Communications, and Control. New York: Wiley.

Hazan, Elad, Karan Singh, and Cyril Zhang. 2017.“Learning Linear Dynamical Systems via Spectral Filtering.” In*NIPS*.

Heaps, Sarah E. 2020.“Enforcing Stationarity Through the Prior in Vector Autoregressions.”*arXiv:2004.09455 [Stat]*, April.

Hochreiter, Sepp. 1998.“The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions.”*International Journal of Uncertainty Fuzziness and Knowledge Based Systems* 6: 107–15.

Hochreiter, Sepp, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber. 2001.“Gradient Flow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies.” In*A Field Guide to Dynamical Recurrent Neural Networks*. IEEE Press.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997.“Long Short-Term Memory.”*Neural Computation* 9 (8): 1735–80.

Hürzeler, Markus, and Hans R. Künsch. 2001.“Approximating and Maximising the Likelihood for a General State-Space Model.” In*Sequential Monte Carlo Methods in Practice*, 159–75. Statistics for Engineering and Information Science. Springer, New York, NY.

Ionides, E. L., C. Bretó, and A. A. King. 2006.“Inference for Nonlinear Dynamical Systems.”*Proceedings of the National Academy of Sciences* 103 (49): 18438–43.

Ionides, Edward L., Anindya Bhadra, Yves Atchadé, and Aaron King. 2011.“Iterated Filtering.”*The Annals of Statistics* 39 (3): 1776–1802.

Jaeger, Herbert. 2002.*Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the” Echo State Network” Approach*. Vol. 5. GMD-Forschungszentrum Informationstechnik.

Jing, Li, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, and Marin Soljačić. 2017.“Tunable Efficient Unitary Neural Networks (EUNN) and Their Application to RNNs.” In*PMLR*, 1733–41.

Kailath, Thomas. 1980.*Linear Systems*. Prentice-Hall Information and System Science Series. Englewood Cliffs, N.J: Prentice-Hall.

Kailath, Thomas, Ali H. Sayed, and Babak Hassibi. 2000.*Linear Estimation*. Prentice Hall Information and System Sciences Series. Upper Saddle River, N.J: Prentice Hall.

Kaul, Shiva. 2020.“Linear Dynamical Systems as a Core Computational Primitive.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Kingma, Diederik P., Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016.“Improving Variational Inference with Inverse Autoregressive Flow.” In*Advances in Neural Information Processing Systems 29*. Curran Associates, Inc.

Krishnan, Rahul G., Uri Shalit, and David Sontag. 2015.“Deep Kalman Filters.”*arXiv Preprint arXiv:1511.05121*.

———. 2017.“Structured Inference Networks for Nonlinear State Space Models.” In*Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence*, 2101–9.

Kutschireiter, Anna, Simone Carlo Surace, Henning Sprekeler, and Jean-Pascal Pfister. 2015a.“A Neural Implementation for Nonlinear Filtering.”*arXiv Preprint arXiv:1508.06818*.

Kutschireiter, Anna, Simone C Surace, Henning Sprekeler, and Jean-Pascal Pfister. 2015b.“Approximate Nonlinear Filtering with a Recurrent Neural Network.”*BMC Neuroscience* 16 (Suppl 1): P196.

Lamb, Alex, Anirudh Goyal, Ying Zhang, Saizheng Zhang, Aaron Courville, and Yoshua Bengio. 2016.“Professor Forcing: A New Algorithm for Training Recurrent Networks.” In*Advances In Neural Information Processing Systems*.

Laurent, Thomas, and James von Brecht. 2016.“A Recurrent Neural Network Without Chaos.”*arXiv:1612.06212 [Cs]*, December.

Li, Yuhong, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. 2022.“What Makes Convolutional Models Great on Long Sequence Modeling?” arXiv.

Lipton, Zachary C. 2016.“Stuck in a What? Adventures in Weight Space.”*arXiv:1602.07320 [Cs]*, February.

Ljung, Lennart. 1999.*System Identification: Theory for the User*. 2nd ed. Prentice Hall Information and System Sciences Series. Upper Saddle River, NJ: Prentice Hall PTR.

Ljung, Lennart, and Torsten Söderström. 1983.*Theory and Practice of Recursive Identification*. The MIT Press Series in Signal Processing, Optimization, and Control 4. Cambridge, Mass: MIT Press.

MacKay, Matthew, Paul Vicol, Jimmy Ba, and Roger Grosse. 2018.“Reversible Recurrent Neural Networks.” In*Advances In Neural Information Processing Systems*.

Marelli, D., and Minyue Fu. 2010.“A Recursive Method for the Approximation of LTI Systems Using Subband Processing.”*IEEE Transactions on Signal Processing* 58 (3): 1025–34.

Martens, James, and Ilya Sutskever. 2011.“Learning Recurrent Neural Networks with Hessian-Free Optimization.” In*Proceedings of the 28th International Conference on International Conference on Machine Learning*, 1033–40. ICML’11. USA: Omnipress.

Mattingley, J., and S. Boyd. 2010.“Real-Time Convex Optimization in Signal Processing.”*IEEE Signal Processing Magazine* 27 (3): 50–61.

Megretski, A. 2003.“Positivity of Trigonometric Polynomials.” In*42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475)*, 4:3814–3817 vol.4.

Mehri, Soroush, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2017.“SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.” In*Proceedings of International Conference on Learning Representations (ICLR) 2017*.

Mhammedi, Zakaria, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. 2017.“Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections.” In*PMLR*, 2401–9.

Miller, John, and Moritz Hardt. 2018.“When Recurrent Models Don’t Need To Be Recurrent.”*arXiv:1805.10369 [Cs, Stat]*, May.

Moradkhani, Hamid, Soroosh Sorooshian, Hoshin V. Gupta, and Paul R. Houser. 2005.“Dual State–Parameter Estimation of Hydrological Models Using Ensemble Kalman Filter.”*Advances in Water Resources* 28 (2): 135–47.

Nerrand, O., P. Roussel-Ragot, L. Personnaz, G. Dreyfus, and S. Marcos. 1993.“Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New Algorithms.”*Neural Computation* 5 (2): 165–99.

Oliveira, Maurício C. de, and Robert E. Skelton. 2001.“Stability Tests for Constrained Linear Systems.” In*Perspectives in Robust Control*, 241–57. Lecture Notes in Control and Information Sciences. Springer, London.

Roberts, Adam, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. 2018.“A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music.”*arXiv:1803.05428 [Cs, Eess, Stat]*, March.

Routtenberg, Tirza, and Joseph Tabrikian. 2010.“Blind MIMO-AR System Identification and Source Separation with Finite-Alphabet.”*IEEE Transactions on Signal Processing* 58 (3): 990–1000.

Seuret, Alexandre, and Frédéric Gouaisbaut. 2013.“Wirtinger-Based Integral Inequality: Application to Time-Delay Systems.”*Automatica* 49 (9): 2860–66.

Sjöberg, Jonas, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. 1995.“Nonlinear Black-Box Modeling in System Identification: A Unified Overview.”*Automatica*, Trends in System Identification, 31 (12): 1691–1724.

Smith, Leonard A. 2000.“Disentangling Uncertainty and Error: On the Predictability of Nonlinear Systems.” In*Nonlinear Dynamics and Statistics*.

Söderström, T., and P. Stoica, eds. 1988.*System Identification*. Upper Saddle River, NJ, USA: Prentice-Hall, Inc.

Stepleton, Thomas, Razvan Pascanu, Will Dabney, Siddhant M. Jayakumar, Hubert Soyer, and Remi Munos. 2018.“Low-Pass Recurrent Neural Networks - A Memory Architecture for Longer-Term Correlation Discovery.”*arXiv:1805.04955 [Cs, Stat]*, May.

Sutskever, Ilya. 2013.“Training Recurrent Neural Networks.” PhD Thesis, Toronto, Ont., Canada, Canada: University of Toronto.

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015.“Going Deeper with Convolutions.” In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*, 1–9.

Telgarsky, Matus. 2017.“Neural Networks and Rational Functions.” In*PMLR*, 3387–93.

Thickstun, John, Zaid Harchaoui, and Sham Kakade. 2017.“Learning Features of Music from Scratch.” In*Proceedings of International Conference on Learning Representations (ICLR) 2017*.

Vardasbi, Ali, Telmo Pessoa Pires, Robin M. Schmidt, and Stephan Peitz. 2023.“State Spaces Aren’t Enough: Machine Translation Needs Attention.” arXiv.

Welch, Peter D. 1967.“The Use of Fast Fourier Transform for the Estimation of Power Spectra: A Method Based on Time Averaging over Short, Modified Periodograms.”*IEEE Transactions on Audio and Electroacoustics* 15 (2): 70–73.

Werbos, Paul J. 1988.“Generalization of Backpropagation with Application to a Recurrent Gas Market Model.”*Neural Networks* 1 (4): 339–56.

———. 1990.“Backpropagation Through Time: What It Does and How to Do It.”*Proceedings of the IEEE* 78 (10): 1550–60.

Wiatowski, Thomas, Philipp Grohs, and Helmut Bölcskei. 2018.“Energy Propagation in Deep Convolutional Neural Networks.”*IEEE Transactions on Information Theory* 64 (7): 1–1.

Williams, Ronald J., and Jing Peng. 1990.“An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories.”*Neural Computation* 2 (4): 490–501.

Wisdom, Scott, Thomas Powers, James Pitton, and Les Atlas. 2016.“Interpretable Recurrent Neural Networks Using Sequential Sparse Recovery.” In*Advances in Neural Information Processing Systems 29*.

Yu, D., and L. Deng. 2011.“Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP].”*IEEE Signal Processing Magazine* 28 (1): 145–54.

Zinkevich, Martin. 2003.“Online Convex Programming and Generalized Infinitesimal Gradient Ascent.” In*Proceedings of the Twentieth International Conference on International Conference on Machine Learning*, 928–35. ICML’03. Washington, DC, USA: AAAI Press.

Kalman-Bucy filter and variants, recursive estimation, predictive state models, Data assimilation. A particular sub-field ofsignal processing formodels with hidden state.

In statistics terms, the state filters are a kind of online-updatinghierarchical model for sequential observations of a dynamical system where the random state is unobserved, but you can get an optimal estimate of it based on incoming measurements and known parameters.

A unifying feature of all these is by assuming a sparse influence graph between observations and dynamics, that you can estimate behaviour using efficientmessage passing.

This is a twin problem tooptimal control.
If I wish to tackle this problem from the perspective of*observations* rather than true state, perhaps I could do it from the perspective ofKoopman operators.

In Kalman filters*per se* the default problem is usually concerned with multivariate real
vector signals representing different axes of some telemetry data.
In the degenerate case, where there is no observation noise, we can justdesign a linear filter which solves the target problem.

The classic Kalman filter(R. E. Kalman 1960) assumes a linear model with Gaussian noise, although it might work with not-quite Gaussian, not-quite linear models if you prod it. You can extend this flavour to somewhat more general dynamics. For that, see later.

NB I’m conflating linear observation and linear process models, for now. We can relax that when there are some concrete examples in play.

There are a large number of equivalent formulations of the Kalman filter. The notation ofFearnhead and Künsch (2018) is representative. They start from the usualstate filter setting: The state process\(\left(\mathbf{X}_{t}\right)\) is assumed to be Markovian and the\(i\)-th observation,\(\mathbf{Y}_{i}\), depends only on the state at time\(i, \mathbf{X}_{i}\), so that the evolution and observation variates are defined by\[ \begin{aligned} \mathbf{X}_{t} \mid\left(\mathbf{x}_{0: t-1}, \mathbf{y}_{1: t-1}\right) & \sim P\left(d \mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right), \quad \mathbf{X}_{0} \sim \pi_{0}\left(d \mathbf{x}_{0}\right) \\ \mathbf{Y}_{t} \mid\left(\mathbf{x}_{0: t}, \mathbf{y}_{1: t-1}\right) & \sim g\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}\right) d \nu\left(\mathbf{y}_{t}\right) \end{aligned} \] with joint distribution\[ \left(\mathbf{X}_{0: s}, \mathbf{Y}_{1: t}\right) \sim \pi_{0}\left(d \mathbf{x}_{0}\right) \prod_{i=1}^{s} P\left(d \mathbf{x}_{i} \mid \mathbf{x}_{i-1}\right) \prod_{j=1}^{t} g\left(\mathbf{y}_{j} \mid \mathbf{x}_{j}\right) \nu\left(d \mathbf{y}_{j}\right), \quad s \geq t. \]

Integrating out the path of the state process, we obtain that\[\begin{aligned} \mathbf{Y}_{1: t} &\sim p\left(\mathbf{y}_{1: t}\right) \prod_{j} \nu\left(d \mathbf{y}_{j}\right)\text{, where}\\ p\left(\mathbf{y}_{1: t}\right) &=\int \pi_{0}\left(d \mathbf{x}_{0}\right) \prod_{i=1}^{s} P\left(d \mathbf{x}_{i} \mid \mathbf{x}_{i-1}\right) \prod_{j=1}^{t} g\left(\mathbf{y}_{j} \mid \mathbf{x}_{j}\right). \end{aligned} \] We wish to find the distribution\(\pi_{0: s \mid t}=\frac{p(\mathbf{y}_{1: t},\mathbf{x}_{0:s})}{p(\mathbf{y}_{1: t})}\) (by Bayes’ rule). We deduce the recursion\[ \begin{aligned} \pi_{0: t \mid t-1}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t-1}\right) &=\pi_{0: t-1 \mid t-1}\left(d \mathbf{x}_{0: t-1} \mid \mathbf{y}_{1: t-1}\right) P\left(d \mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right) &\text{ prediction}\\ \pi_{0: t \mid t}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t}\right) &=\pi_{0: t \mid t-1}\left(d \mathbf{x}_{0: t} \mid \mathbf{y}_{1: t-1}\right) \frac{g\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}\right)}{p\left(\mathbf{y}_{t} \mid \mathbf{y}_{1: t-1}\right)} &\text{ correction} \end{aligned} \] where\[ p\left(\mathbf{y}_{t} \mid \mathbf{y}_{1: t-1}\right)=\frac{p\left(\mathbf{y}_{1: t}\right)}{p\left(\mathbf{y}_{1: t-1}\right)}=\int \pi_{t \mid t-1}\left(d \mathbf{x}_{t} \mid \mathbf{y}_{1: t-1}\right) g\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}\right) . \] Integrating out all but the latest states\(\mathbf{x}_{0: t-1}\) gives us the one-step recursion\[ \begin{aligned} \pi_{t \mid t-1}\left(d \mathbf{x}_{t} \mid \mathbf{y}_{1: t-1}\right) &=\int \pi_{t-1}\left(d \mathbf{x}_{t-1} \mid \mathbf{y}_{1: t-1}\right) P\left(d \mathbf{x}_{t} \mid \mathbf{x}_{t-1}\right) &\text{ prediction}\\ \pi_{t}\left(d \mathbf{x}_{t} \mid \mathbf{y}_{1: t}\right) &=\pi_{t \mid t-1}\left(d \mathbf{x}_{t} \mid \mathbf{y}_{1: t-1}\right) \frac{g\left(\mathbf{y}_{t} \mid \mathbf{x}_{t}\right)}{p_{t}\left(\mathbf{y}_{t} \mid \mathbf{y}_{1: t-1}\right)}&\text{ correction} \end{aligned} \]

If we approximate the filter distribution\(\pi_t\) with a Monte Carlo sample, we are doingparticle filtering, whichFearnhead and Künsch (2018) refer to as*bootstrap filtering*.

TODO: implied Kalman gain etc.

Cute exercise: you can derive the analytic Kalman filter for any noise and process dynamics of with Bayesian conjugate, and this leads to filters of nonlinear behaviour. Multivariate distributions are a bit of a mess for non-Gaussians, though, and a beta-Kalman filter feels contrived.

Upshot is, the non-linear extensions don’t usually rely on non-Gaussian conjugate distributions and analytic forms, but rather do some Gaussian/linear approximation, or use randomised methods such asparticle filters.

For some examples in Stan see Sinhrks’stan-statespace.

see, e.g.Bagge Carlson (2018).

i.e. using theunscented transform.

How about learning the*parameters* of the model generating your states?
Ways that you can do this in dynamical systems include basiclinear system identification,general system identification, .

Aasnaes, H., and T. Kailath. 1973.“An Innovations Approach to Least-Squares Estimation–Part VII: Some Applications of Vector Autoregressive-Moving Average Models.”*IEEE Transactions on Automatic Control* 18 (6): 601–7.

Alliney, S. 1992.“Digital Filters as Absolute Norm Regularizers.”*IEEE Transactions on Signal Processing* 40 (6): 1548–62.

Ansley, Craig F., and Robert Kohn. 1985.“Estimation, Filtering, and Smoothing in State Space Models with Incompletely Specified Initial Conditions.”*The Annals of Statistics* 13 (4): 1286–316.

Arulampalam, M. S., S. Maskell, N. Gordon, and T. Clapp. 2002.“A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking.”*IEEE Transactions on Signal Processing* 50 (2): 174–88.

Bagge Carlson, Fredrik. 2018.“Machine Learning and System Identification for Estimation in Physical Systems.” Thesis/docmono, Lund University.

Battey, Heather, and Alessio Sancetta. 2013.“Conditional Estimation for Dependent Functional Data.”*Journal of Multivariate Analysis* 120 (September): 1–17.

Batz, Philipp, Andreas Ruttor, and Manfred Opper. 2017.“Approximate Bayes Learning of Stochastic Differential Equations.”*arXiv:1702.05390 [Physics, Stat]*, February.

Becker, Philipp, Harit Pandya, Gregor Gebhardt, Cheng Zhao, C. James Taylor, and Gerhard Neumann. 2019.“Recurrent Kalman Networks: Factorized Inference in High-Dimensional Deep Feature Spaces.” In*International Conference on Machine Learning*, 544–52.

Berkhout, A. J., and P. R. Zaanen. 1976.“A Comparison Between Wiener Filtering, Kalman Filtering, and Deterministic Least Squares Estimation*.”*Geophysical Prospecting* 24 (1): 141–97.

Bilmes, Jeff A. 1998.“A Gentle Tutorial of the EM Algorithm and Its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models.”*International Computer Science Institute* 4 (510): 126.

Bishop, Adrian N., and Pierre Del Moral. 2016.“On the Stability of Kalman-Bucy Diffusion Processes.”*SIAM Journal on Control and Optimization* 55 (6): 4015–47.

———. 2023.“On the Mathematical Theory of Ensemble (Linear-Gaussian) Kalman-Bucy Filtering.” arXiv.

Bishop, Adrian N., Pierre Del Moral, and Sahani D. Pathiraja. 2017.“Perturbations and Projections of Kalman-Bucy Semigroups Motivated by Methods in Data Assimilation.”*arXiv:1701.05978 [Math]*, January.

Bretó, Carles, Daihai He, Edward L. Ionides, and Aaron A. King. 2009.“Time Series Analysis via Mechanistic Models.”*The Annals of Applied Statistics* 3 (1): 319–48.

Brunton, Steven L., Joshua L. Proctor, and J. Nathan Kutz. 2016.“Discovering Governing Equations from Data by Sparse Identification of Nonlinear Dynamical Systems.”*Proceedings of the National Academy of Sciences* 113 (15): 3932–37.

Campbell, Andrew, Yuyang Shi, Tom Rainforth, and Arnaud Doucet. 2021.“Online Variational Filtering and Parameter Learning.” In.

Carmi, Avishy Y. 2013.“Compressive System Identification: Sequential Methods and Entropy Bounds.”*Digital Signal Processing* 23 (3): 751–70.

———. 2014.“Compressive System Identification.” In*Compressed Sensing & Sparse Filtering*, edited by Avishy Y. Carmi, Lyudmila Mihaylova, and Simon J. Godsill, 281–324. Signals and Communication Technology. Springer Berlin Heidelberg.

Cassidy, Ben, Caroline Rae, and Victor Solo. 2015.“Brain Activity: Connectivity, Sparsity, and Mutual Information.”*IEEE Transactions on Medical Imaging* 34 (4): 846–60.

Cauchemez, Simon, and Neil M. Ferguson. 2008.“Likelihood-Based Estimation of Continuous-Time Epidemic Models from Time-Series Data: Application to Measles Transmission in London.”*Journal of The Royal Society Interface* 5 (25): 885–97.

Charles, Adam, Aurele Balavoine, and Christopher Rozell. 2016.“Dynamic Filtering of Time-Varying Sparse Signals via L1 Minimization.”*IEEE Transactions on Signal Processing* 64 (21): 5644–56.

Chen, Bin, and Yongmiao Hong. 2012.“Testing for the Markov Property in Time Series.”*Econometric Theory* 28 (01): 130–78.

Chen, Y., and A. O. Hero. 2012.“Recursive ℓ1,∞ Group Lasso.”*IEEE Transactions on Signal Processing* 60 (8): 3978–87.

Chung, Junyoung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015.“A Recurrent Latent Variable Model for Sequential Data.” In*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2980–88. Curran Associates, Inc.

Clark, James S., and Ottar N. Bjørnstad. 2004.“Population Time Series: Process Variability, Observation Errors, Missing Values, Lags, and Hidden States.”*Ecology* 85 (11): 3140–50.

Commandeur, Jacques J. F., and Siem Jan Koopman. 2007.*An Introduction to State Space Time Series Analysis*. 1 edition. Oxford ; New York: Oxford University Press.

Cox, Marco, Thijs van de Laar, and Bert de Vries. 2019.“A Factor Graph Approach to Automated Design of Bayesian Signal Processing Algorithms.”*International Journal of Approximate Reasoning* 104 (January): 185–204.

Cressie, Noel, and Hsin-Cheng Huang. 1999.“Classes of Nonseparable, Spatio-Temporal Stationary Covariance Functions.”*Journal of the American Statistical Association* 94 (448): 1330–39.

Cressie, Noel, Tao Shi, and Emily L. Kang. 2010.“Fixed Rank Filtering for Spatio-Temporal Data.”*Journal of Computational and Graphical Statistics* 19 (3): 724–45.

Cressie, Noel, and Christopher K. Wikle. 2011.*Statistics for Spatio-Temporal Data*. Wiley Series in Probability and Statistics 2.0. John Wiley and Sons.

Deisenroth, Marc Peter, and Shakir Mohamed. 2012.“Expectation Propagation in Gaussian Process Dynamical Systems.” In*Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2*, 25:2609–17. NIPS’12. Red Hook, NY, USA: Curran Associates Inc.

Del Moral, P., A. Kurtzmann, and J. Tugaut. 2017.“On the Stability and the Uniform Propagation of Chaos of a Class of Extended Ensemble Kalman-Bucy Filters.”*SIAM Journal on Control and Optimization* 55 (1): 119–55.

Doucet, Arnaud, Pierre E. Jacob, and Sylvain Rubenthaler. 2013.“Derivative-Free Estimation of the Score Vector and Observed Information Matrix with Application to State-Space Models.”*arXiv:1304.5768 [Stat]*, April.

Durbin, J., and S. J. Koopman. 1997.“Monte Carlo Maximum Likelihood Estimation for Non-Gaussian State Space Models.”*Biometrika* 84 (3): 669–84.

———. 2012.*Time Series Analysis by State Space Methods*. 2nd ed. Oxford Statistical Science Series 38. Oxford: Oxford University Press.

Duttweiler, D., and T. Kailath. 1973a.“RKHS Approach to Detection and Estimation Problems–IV: Non-Gaussian Detection.”*IEEE Transactions on Information Theory* 19 (1): 19–28.

———. 1973b.“RKHS Approach to Detection and Estimation Problems–V: Parameter Estimation.”*IEEE Transactions on Information Theory* 19 (1): 29–37.

Easley, Deanna, and Tyrus Berry. 2020.“A Higher Order Unscented Transform.”*arXiv:2006.13429 [Cs, Math]*, June.

Eddy, Sean R. 1996.“Hidden Markov Models.”*Current Opinion in Structural Biology* 6 (3): 361–65.

Eden, U, L Frank, R Barbieri, V Solo, and E Brown. 2004.“Dynamic Analysis of Neural Encoding by Point Process Adaptive Filtering.”*Neural Computation* 16 (5): 971–98.

Edwards, David, and Smitha Ankinakatte. 2015.“Context-Specific Graphical Models for Discrete Longitudinal Data.”*Statistical Modelling* 15 (4): 301–25.

Eleftheriadis, Stefanos, Tom Nicholson, Marc Deisenroth, and James Hensman. 2017.“Identification of Gaussian Process State Space Models.” In*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 5309–19. Curran Associates, Inc.

Fearnhead, Paul, and Hans R. Künsch. 2018.“Particle Filters and Data Assimilation.”*Annual Review of Statistics and Its Application* 5 (1): 421–49.

Finke, Axel, and Sumeetpal S. Singh. 2016.“Approximate Smoothing and Parameter Estimation in High-Dimensional State-Space Models.”*arXiv:1606.08650 [Stat]*, June.

Föll, Roman, Bernard Haasdonk, Markus Hanselmann, and Holger Ulmer. 2017.“Deep Recurrent Gaussian Process with Variational Sparse Spectrum Approximation.”*arXiv:1711.00799 [Stat]*, November.

Fraccaro, Marco, Sø ren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. 2016.“Sequential Neural Models with Stochastic Layers.” In*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2199–2207. Curran Associates, Inc.

Fraser, Andrew M. 2008.*Hidden Markov Models and Dynamical Systems*. Philadelphia, PA: Society for Industrial and Applied Mathematics.

Freitas, J. F. G. de, Mahesan Niranjan, A. H. Gee, and Arnaud Doucet. 1998.“Sequential Monte Carlo Methods for Optimisation of Neural Network Models.”*Cambridge University Engineering Department, Cambridge, England, Technical Report TR-328*.

Freitas, João FG de, Arnaud Doucet, Mahesan Niranjan, and Andrew H. Gee. 1998.“Global Optimisation of Neural Network Models via Sequential Sampling.” In*Proceedings of the 11th International Conference on Neural Information Processing Systems*, 410–16. NIPS’98. Cambridge, MA, USA: MIT Press.

Friedlander, B., T. Kailath, and L. Ljung. 1975.“Scattering Theory and Linear Least Squares Estimation: Part II: Discrete-Time Problems.” In*1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes*, 57–58.

Frigola, Roger, Yutian Chen, and Carl Edward Rasmussen. 2014.“Variational Gaussian Process State-Space Models.” In*Advances in Neural Information Processing Systems 27*, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 3680–88. Curran Associates, Inc.

Frigola, Roger, Fredrik Lindsten, Thomas B Schön, and Carl Edward Rasmussen. 2013.“Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC.” In*Advances in Neural Information Processing Systems 26*, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 3156–64. Curran Associates, Inc.

Friston, K. J. 2008.“Variational Filtering.”*NeuroImage* 41 (3): 747–66.

Gevers, M., and T. Kailath. 1973.“An Innovations Approach to Least-Squares Estimation–Part VI: Discrete-Time Innovations Representations and Recursive Estimation.”*IEEE Transactions on Automatic Control* 18 (6): 588–600.

Gorad, Ajinkya, Zheng Zhao, and Simo Särkkä. 2020.“Parameter Estimation in Non-Linear State-Space Models by Automatic Differentiation of Non-Linear Kalman Filters.” In, 6.

Gottwald, Georg A., and Sebastian Reich. 2020.“Supervised Learning from Noisy Observations: Combining Machine-Learning Techniques with Data Assimilation.”*arXiv:2007.07383 [Physics, Stat]*, July.

Gourieroux, Christian, and Joann Jasiak. 2015.“Filtering, Prediction and Simulation Methods for Noncausal Processes.”*Journal of Time Series Analysis*, January, n/a–.

Gu, Albert, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. 2021.“Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In*Advances in Neural Information Processing Systems*, 34:572–85. Curran Associates, Inc.

Haber, Eldad, Felix Lucka, and Lars Ruthotto. 2018.“Never Look Back - A Modified EnKF Method and Its Application to the Training of Neural Networks Without Back Propagation.”*arXiv:1805.08034 [Cs, Math]*, May.

Hamilton, Franz, Tyrus Berry, and Timothy Sauer. 2016.“Kalman-Takens Filtering in the Presence of Dynamical Noise.”*arXiv:1611.05414 [Physics, Stat]*, November.

Hartikainen, J., and S. Särkkä. 2010.“Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process Regression Models.” In*2010 IEEE International Workshop on Machine Learning for Signal Processing*, 379–84. Kittila, Finland: IEEE.

Harvey, A., and S. J. Koopman. 2005.“Structural Time Series Models.” In*Encyclopedia of Biostatistics*. John Wiley & Sons, Ltd.

Harvey, Andrew, and Alessandra Luati. 2014.“Filtering With Heavy Tails.”*Journal of the American Statistical Association* 109 (507): 1112–22.

He, Daihai, Edward L. Ionides, and Aaron A. King. 2010.“Plug-and-Play Inference for Disease Dynamics: Measles in Large and Small Populations as a Case Study.”*Journal of The Royal Society Interface* 7 (43): 271–83.

Hefny, Ahmed, Carlton Downey, and Geoffrey Gordon. 2015.“A New View of Predictive State Methods for Dynamical System Learning.”*arXiv:1505.05310 [Cs, Stat]*, May.

Hong, X., R. J. Mitchell, S. Chen, C. J. Harris, K. Li, and G. W. Irwin. 2008.“Model Selection Approaches for Non-Linear System Identification: A Review.”*International Journal of Systems Science* 39 (10): 925–46.

Hou, Elizabeth, Earl Lawrence, and Alfred O. Hero. 2016.“Penalized Ensemble Kalman Filters for High Dimensional Non-Linear Systems.”*arXiv:1610.00195 [Physics, Stat]*, October.

Hsiao, Roger, and Tanja Schultz. 2011.“Generalized Baum-Welch Algorithm and Its Implication to a New Extended Baum-Welch Algorithm.” In*In Proceedings of INTERSPEECH*.

Hsu, Daniel, Sham M. Kakade, and Tong Zhang. 2012.“A Spectral Algorithm for Learning Hidden Markov Models.”*Journal of Computer and System Sciences*, JCSS Special Issue: Cloud Computing 2011, 78 (5): 1460–80.

Huber, Marco F. 2014.“Recursive Gaussian Process: On-Line Regression and Learning.”*Pattern Recognition Letters* 45 (August): 85–91.

Ionides, E. L., C. Bretó, and A. A. King. 2006.“Inference for Nonlinear Dynamical Systems.”*Proceedings of the National Academy of Sciences* 103 (49): 18438–43.

Ionides, Edward L., Anindya Bhadra, Yves Atchadé, and Aaron King. 2011.“Iterated Filtering.”*The Annals of Statistics* 39 (3): 1776–1802.

Johansen, Adam, Arnaud Doucet, and Manuel Davy. 2006.“Sequential Monte Carlo for Marginal Optimisation Problems.”*Scis & Isis* 2006: 1866–71.

Johnson, Matthew James. 2012.“A Simple Explanation of A Spectral Algorithm for Learning Hidden Markov Models.”*arXiv:1204.2477 [Cs, Stat]*, April.

Julier, S.J., J.K. Uhlmann, and H.F. Durrant-Whyte. 1995.“A New Approach for Filtering Nonlinear Systems.” In*American Control Conference, Proceedings of the 1995*, 3:1628–1632 vol.3.

Kailath, T. 1971.“RKHS Approach to Detection and Estimation Problems–I: Deterministic Signals in Gaussian Noise.”*IEEE Transactions on Information Theory* 17 (5): 530–49.

———. 1974.“A View of Three Decades of Linear Filtering Theory.”*IEEE Transactions on Information Theory* 20 (2): 146–81.

Kailath, T., and D. Duttweiler. 1972.“An RKHS Approach to Detection and Estimation Problems– III: Generalized Innovations Representations and a Likelihood-Ratio Formula.”*IEEE Transactions on Information Theory* 18 (6): 730–45.

Kailath, T., and R. Geesey. 1971.“An Innovations Approach to Least Squares Estimation–Part IV: Recursive Estimation Given Lumped Covariance Functions.”*IEEE Transactions on Automatic Control* 16 (6): 720–27.

———. 1973.“An Innovations Approach to Least-Squares Estimation–Part V: Innovations Representations and Recursive Estimation in Colored Noise.”*IEEE Transactions on Automatic Control* 18 (5): 435–53.

Kailath, T., and H. Weinert. 1975.“An RKHS Approach to Detection and Estimation Problems–II: Gaussian Signal Detection.”*IEEE Transactions on Information Theory* 21 (1): 15–23.

Kalman, R. 1959.“On the General Theory of Control Systems.”*IRE Transactions on Automatic Control* 4 (3): 110–10.

Kalman, R. E. 1960.“A New Approach to Linear Filtering and Prediction Problems.”*Journal of Basic Engineering* 82 (1): 35.

Kalouptsidis, Nicholas, Gerasimos Mileounis, Behtash Babadi, and Vahid Tarokh. 2011.“Adaptive Algorithms for Sparse System Identification.”*Signal Processing* 91 (8): 1910–19.

Karvonen, Toni, and Simo Särkkä. 2016.“Approximate State-Space Gaussian Processes via Spectral Transformation.” In*2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP)*, 1–6. Vietri sul Mare, Salerno, Italy: IEEE.

Kelly, D. T. B., K. J. H. Law, and A. M. Stuart. 2014.“Well-Posedness and Accuracy of the Ensemble Kalman Filter in Discrete and Continuous Time.”*Nonlinearity* 27 (10): 2579.

Kirch, Claudia, Matthew C. Edwards, Alexander Meier, and Renate Meyer. 2019.“Beyond Whittle: Nonparametric Correction of a Parametric Likelihood with a Focus on Bayesian Time Series Analysis.”*Bayesian Analysis* 14 (4): 1037–73.

Kitagawa, Genshiro. 1987.“Non-Gaussian State—Space Modeling of Nonstationary Time Series.”*Journal of the American Statistical Association* 82 (400): 1032–41.

———. 1996.“Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Space Models.”*Journal of Computational and Graphical Statistics* 5 (1): 1–25.

Kitagawa, Genshiro, and Will Gersch. 1996.*Smoothness Priors Analysis of Time Series*. Lecture notes in statistics 116. New York, NY: Springer New York : Imprint : Springer.

Kobayashi, Hisashi, Brian L. Mark, and William Turin. 2011.*Probability, Random Processes, and Statistical Analysis: Applications to Communications, Signal Processing, Queueing Theory and Mathematical Finance*. Cambridge University Press.

Koopman, S. J., and J. Durbin. 2000.“Fast Filtering and Smoothing for Multivariate State Space Models.”*Journal of Time Series Analysis* 21 (3): 281–96.

Krishnan, Rahul G., Uri Shalit, and David Sontag. 2017.“Structured Inference Networks for Nonlinear State Space Models.” In*Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence*, 2101–9.

Kutschireiter, Anna, Simone C Surace, Henning Sprekeler, and Jean-Pascal Pfister. 2015.“Approximate Nonlinear Filtering with a Recurrent Neural Network.”*BMC Neuroscience* 16 (Suppl 1): P196.

Lázaro-Gredilla, Miguel, Joaquin Quiñonero-Candela, Carl Edward Rasmussen, and Aníbal R. Figueiras-Vidal. 2010.“Sparse Spectrum Gaussian Process Regression.”*Journal of Machine Learning Research* 11 (Jun): 1865–81.

Le Gland, François, Valerie Monbet, and Vu-Duc Tran. 2009.“Large Sample Asymptotics for the Ensemble Kalman Filter,” 25.

Lei, Jing, Peter Bickel, and Chris Snyder. 2009.“Comparison of Ensemble Kalman Filters Under Non-Gaussianity.”*Monthly Weather Review* 138 (4): 1293–1306.

Levin, David N. 2017.“The Inner Structure of Time-Dependent Signals.”*arXiv:1703.08596 [Cs, Math, Stat]*, March.

Lindgren, Finn, Håvard Rue, and Johan Lindström. 2011.“An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach.”*Journal of the Royal Statistical Society: Series B (Statistical Methodology)* 73 (4): 423–98.

Ljung, L., and T. Kailath. 1976.“Backwards Markovian Models for Second-Order Stochastic Processes (Corresp.).”*IEEE Transactions on Information Theory* 22 (4): 488–91.

Ljung, L., T. Kailath, and B. Friedlander. 1975.“Scattering Theory and Linear Least Squares Estimation: Part I: Continuous-Time Problems.” In*1975 IEEE Conference on Decision and Control Including the 14th Symposium on Adaptive Processes*, 55–56.

Loeliger, Hans-Andrea, Justin Dauwels, Junli Hu, Sascha Korl, Li Ping, and Frank R. Kschischang. 2007.“The Factor Graph Approach to Model-Based Signal Processing.”*Proceedings of the IEEE* 95 (6): 1295–1322.

Manton, J. H., V. Krishnamurthy, and H. V. Poor. 1998.“James-Stein State Filtering Algorithms.”*IEEE Transactions on Signal Processing* 46 (9): 2431–47.

Mattos, César Lincoln C., Zhenwen Dai, Andreas Damianou, Guilherme A. Barreto, and Neil D. Lawrence. 2017.“Deep Recurrent Gaussian Processes for Outlier-Robust System Identification.”*Journal of Process Control*, DYCOPS-CAB 2016, 60 (December): 82–94.

Mattos, César Lincoln C., Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A. Barreto, and Neil D. Lawrence. 2016.“Recurrent Gaussian Processes.” In*Proceedings of ICLR*.

Meyer, Renate, Matthew C. Edwards, Patricio Maturana-Russel, and Nelson Christensen. 2020.“Computational Techniques for Parameter Estimation of Gravitational Wave Signals.”*WIREs Computational Statistics* n/a (n/a): e1532.

Micchelli, Charles A., and Peder Olsen. 2000.“Penalized Maximum-Likelihood Estimation, the Baum–Welch Algorithm, Diagonal Balancing of Symmetric Matrices and Applications to Training Acoustic Data.”*Journal of Computational and Applied Mathematics* 119 (1–2): 301–31.

Miller, David L., Richard Glennie, and Andrew E. Seaton. 2020.“Understanding the Stochastic Partial Differential Equation Approach to Smoothing.”*Journal of Agricultural, Biological and Environmental Statistics* 25 (1): 1–16.

Nickisch, Hannes, Arno Solin, and Alexander Grigorevskiy. 2018.“State Space Gaussian Processes with Non-Gaussian Likelihood.” In*International Conference on Machine Learning*, 3789–98.

Olfati-Saber, R. 2005.“Distributed Kalman Filter with Embedded Consensus Filters.” In*44th IEEE Conference on Decision and Control, 2005 and 2005 European Control Conference. CDC-ECC ’05*, 8179–84. Seville, Spain: IEEE.

Ollivier, Yann. 2017.“Online Natural Gradient as a Kalman Filter.”*arXiv:1703.00209 [Math, Stat]*, March.

Papadopoulos, Alexandre, François Pachet, Pierre Roy, and Jason Sakellariou. 2015.“Exact Sampling for Regular and Markov Constraints with Belief Propagation.” In*Principles and Practice of Constraint Programming*, 341–50. Lecture Notes in Computer Science. Switzerland: Springer, Cham.

Perry, T.S. 2010.“Andrew Viterbi’s Fabulous Formula [Medal of Honor].”*IEEE Spectrum* 47 (5): 47–50.

Picci, G. 1991.“Stochastic Realization Theory.” In*Mathematical System Theory: The Influence of R. E. Kalman*, edited by Athanasios C. Antoulas, 213–29. Berlin, Heidelberg: Springer.

Psiaki, M. 2013.“The Blind Tricyclist Problem and a Comparative Study of Nonlinear Filters: A Challenging Benchmark for Evaluating Nonlinear Estimation Methods.”*IEEE Control Systems* 33 (3): 40–54.

Pugachev, V. S., and I. N. Sinit︠s︡yn. 2001.*Stochastic systems: theory and applications*. River Edge, NJ: World Scientific.

Pugachev, V.S. 1982.“Conditionally Optimal Estimation in Stochastic Differential Systems.”*Automatica* 18 (6): 685–96.

Quiñonero-Candela, Joaquin, and Carl Edward Rasmussen. 2005.“A Unifying View of Sparse Approximate Gaussian Process Regression.”*Journal of Machine Learning Research* 6 (Dec): 1939–59.

Rabiner, L., and B.H. Juang. 1986.“An Introduction to Hidden Markov Models.”*IEEE ASSP Magazine* 3 (1): 4–16.

Rabiner, L.R. 1989.“A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.”*Proceedings of the IEEE* 77 (2): 257–86.

Raol, J. R., and N. K. Sinha. 1987.“On Pugachev’s Filtering Theory for Stochastic Nonlinear Systems.” In*Stochastic Control*, edited by N. K. Sinha and L. A. Telksnys, 183–88. IFAC Symposia Series. Oxford: Pergamon.

Reece, S., and S. Roberts. 2010.“An Introduction to Gaussian Processes for the Kalman Filter Expert.” In*2010 13th International Conference on Information Fusion*, 1–9.

Reller, Christoph. 2013.“State-Space Methods in Statistical Signal Processing: New Ideas and Applications.” Application/pdf. Konstanz: ETH Zurich.

Revach, Guy, Nir Shlezinger, Ruud J. G. van Sloun, and Yonina C. Eldar. 2021.“Kalmannet: Data-Driven Kalman Filtering.” In*ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 3905–9.

Robertson, Andrew N. 2011.“A Bayesian Approach to Drum Tracking.” In.

Robertson, Andrew, and Mark Plumbley. 2007.“B-Keeper: A Beat-Tracker for Live Performance.” In*Proceedings of the 7th International Conference on New Interfaces for Musical Expression*, 234–37. NIME ’07. New York, NY, USA: ACM.

Robertson, Andrew, Adam M. Stark, and Mark D. Plumbley. 2011.“Real-Time Visual Beat Tracking Using a Comb Filter Matrix.” In*Proceedings of the International Computer Music Conference 2011*.

Robertson, Andrew, Adam Stark, and Matthew EP Davies. 2013.“Percussive Beat Tracking Using Real-Time Median Filtering.” In*Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases*.

Rodriguez, Alejandro, and Esther Ruiz. 2009.“Bootstrap Prediction Intervals in State–Space Models.”*Journal of Time Series Analysis* 30 (2): 167–78.

Roth, Michael, Gustaf Hendeby, Carsten Fritsche, and Fredrik Gustafsson. 2017.“The Ensemble Kalman Filter: A Signal Processing Perspective.”*EURASIP Journal on Advances in Signal Processing* 2017 (1): 56.

Rudenko, E. A. 2013.“Optimal Structure of Continuous Nonlinear Reduced-Order Pugachev Filter.”*Journal of Computer and Systems Sciences International* 52 (6): 866–92.

Särkkä, S., and J. Hartikainen. 2013.“Non-Linear Noise Adaptive Kalman Filtering via Variational Bayes.” In*2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP)*, 1–6.

Särkkä, Simo. 2007.“On Unscented Kalman Filtering for State Estimation of Continuous-Time Nonlinear Systems.”*IEEE Transactions on Automatic Control* 52 (9): 1631–41.

———. 2013.*Bayesian Filtering and Smoothing*. Institute of Mathematical Statistics Textbooks 3. Cambridge, U.K. ; New York: Cambridge University Press.

Särkkä, Simo, and Jouni Hartikainen. 2012.“Infinite-Dimensional Kalman Filtering Approach to Spatio-Temporal Gaussian Process Regression.” In*Artificial Intelligence and Statistics*.

Särkkä, Simo, and A. Nummenmaa. 2009.“Recursive Noise Adaptive Kalman Filtering by Variational Bayesian Approximations.”*IEEE Transactions on Automatic Control* 54 (3): 596–600.

Särkkä, Simo, A. Solin, and J. Hartikainen. 2013.“Spatiotemporal Learning via Infinite-Dimensional Bayesian Filtering and Smoothing: A Look at Gaussian Process Regression Through Kalman Filtering.”*IEEE Signal Processing Magazine* 30 (4): 51–61.

Schein, Aaron, Hanna Wallach, and Mingyuan Zhou. 2016.“Poisson-Gamma Dynamical Systems.” In*Advances In Neural Information Processing Systems*, 5006–14.

Schmidt, Jonathan, Nicholas Krämer, and Philipp Hennig. 2021.“A Probabilistic State Space Model for Joint Inference from Differential Equations and Data.”*arXiv:2103.10153 [Cs, Stat]*, June.

Segall, A., M. Davis, and T. Kailath. 1975.“Nonlinear Filtering with Counting Observations.”*IEEE Transactions on Information Theory* 21 (2): 143–49.

Šindelář, Jan, Igor Vajda, and Miroslav Kárnỳ. 2008.“Stochastic Control Optimal in the Kullback Sense.”*Kybernetika* 44 (1): 53–60.

Sorenson, H.W. 1970.“Least-Squares Estimation: From Gauss to Kalman.”*IEEE Spectrum* 7 (7): 63–68.

Städler, Nicolas, and Sach Mukherjee. 2013.“Penalized Estimation in High-Dimensional Hidden Markov Models with State-Specific Graphical Models.”*The Annals of Applied Statistics* 7 (4): 2157–79.

Surace, Simone Carlo, and Jean-Pascal Pfister. 2016.“Online Maximum Likelihood Estimation of the Parameters of Partially Observed Diffusion Processes.” In.

Tavakoli, Shahin, and Victor M. Panaretos. 2016.“Detecting and Localizing Differences in Functional Time Series Dynamics: A Case Study in Molecular Biophysics.”*Journal of the American Statistical Association*, March, 1–31.

Thrun, Sebastian, and John Langford. 1998.“Monte Carlo Hidden Markov Models.” DTIC Document.

Thrun, Sebastian, John Langford, and Dieter Fox. 1999.“Monte Carlo Hidden Markov Models: Learning Non-Parametric Models of Partially Observable Stochastic Processes.” In*Proceedings of the International Conference on Machine Learning*. Bled, Slovenia.

Turner, Ryan, Marc Deisenroth, and Carl Rasmussen. 2010.“State-Space Inference and Learning with Gaussian Processes.” In*Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics*, 868–75.

Wikle, Christopher K., and L. Mark Berliner. 2007.“A Bayesian Tutorial for Data Assimilation.”*Physica D: Nonlinear Phenomena*, Data Assimilation, 230 (1): 1–16.

Wikle, Christopher K., L. Mark Berliner, and Noel Cressie. 1998.“Hierarchical Bayesian Space-Time Models.”*Environmental and Ecological Statistics* 5 (2): 117–54.

Zhao, Yiran, and Tiangang Cui. 2023.“Tensor-Based Methods for Sequential State and Parameter Estimation in State Space Models.” arXiv.

Zoeter, Onno. 2007.“Bayesian Generalized Linear Models in a Terabyte World.” In*2007 5th International Symposium on Image and Signal Processing and Analysis*, 435–40. Istanbul, Turkey: IEEE.

A termcoined by Language log to describe when specialists use jargon in public, oblivious to the fact that normal humans will have no way of understanding it.

This page exists so I can document cute examples of nerdview.

An importantmatrix factorisation. TBC

(Brand 2006,2002;Bunch and Nielsen 1978;Gu and Eisenstat 1995,1993;Sarwar et al. 2002;Zhang 2022).

TODO.

Carlo Tomasi, elegantly pedagogic.

Avrim Blumon SVD

Bach, C, D. Ceglia, L. Song, and F. Duddeck. 2019.“Randomized Low-Rank Approximation Methods for Projection-Based Model Order Reduction of Large Nonlinear Dynamical Problems.”*International Journal for Numerical Methods in Engineering* 118 (4): 209–41.

Brand, Matthew. 2002.“Incremental Singular Value Decomposition of Uncertain Data with Missing Values.” In*Computer Vision — ECCV 2002*, edited by Anders Heyden, Gunnar Sparr, Mads Nielsen, and Peter Johansen, 2350:707–20. Berlin, Heidelberg: Springer Berlin Heidelberg.

———. 2006.“Fast Low-Rank Modifications of the Thin Singular Value Decomposition.”*Linear Algebra and Its Applications*, Special Issue on Large Scale Linear and Nonlinear Eigenvalue Problems, 415 (1): 20–30.

Bunch, James R., and Christopher P. Nielsen. 1978.“Updating the Singular Value Decomposition.”*Numerische Mathematik* 31 (2): 111–29.

Gu, Ming, and Stanley C. Eisenstat. 1993.“A Stable and Fast Algorithm for Updating the Singular Value Decomposition.” Citeseer.

———. 1995.“Downdating the Singular Value Decomposition.”*SIAM Journal on Matrix Analysis and Applications* 16 (3): 793–810.

Halko, Nathan, Per-Gunnar Martinsson, and Joel A. Tropp. 2010.“Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions.” arXiv.

Hastie, Trevor, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. 2015.“Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.” In*Journal of Machine Learning Research*, 16:3367–3402.

Rabani, Eran, and Sivan Toledo. 2001.“Out-of-Core SVD and QR Decompositions.” In*PPSC*.

Sarwar, Badrul, George Karypis, Joseph Konstan, and John Riedl. 2002.“Incremental Singular Value Decomposition Algorithms for Highly Scalable Recommender Systems.”

Tropp, Joel A., Alp Yurtsever, Madeleine Udell, and Volkan Cevher. 2016.“Randomized Single-View Algorithms for Low-Rank Matrix Approximation.”*arXiv:1609.00048 [Cs, Math, Stat]*, August.

Zhang, Yangwen. 2022.“An Answer to an Open Question in the Incremental SVD.” arXiv.

A cousin toneural automata: writing machines to code for us. We might also want to write code tospeak for us, which ends up involving similar technology, i.e.large language models.

I am vaguely concerned about how much of the world is uploading their source code for everything to these code servers. The potential for abuse is huge.

GitHub Copilot uses suggestions fromOpenAI Codex to suggest code completion.

**Pro tip.** behind a firewall, requires at leastthe following whitelist exceptions:

`vscode-auth.github.com`

`api.github.com`

`copilot-proxy.githubusercontent.com`

SeeNetworked VS Code for some more whitelest rules we need for VS Code generally.

Codeium has been developed by the team atExafunction to build on the industry-wide momentum on foundational models. We realized that the combination of recent advances in generative models and our world-class optimized deep learning serving software could provide users with top quality AI-based products at the lowest possible costs (or ideally, free!).

AI Code Generator - Amazon CodeWhisperer - AWS

Available as part of the AWS Toolkit for Visual Studio (VS) Code and JetBrains, CodeWhisperer currently supports Python, Java, JavaScript, TypeScript, C#, Go, Rust, PHP, Ruby, Kotlin, C, C++, Shell scripting, SQL and Scala. In addition to VS Code and the JetBrains family of IDEs—including IntelliJ, PyCharm, GoLand, CLion, PhpStorm, RubyMine, Rider, WebStorm, and DataGrip—CodeWhisperer is also available for AWS Cloud9, AWS Lambda console, JupyterLab and Amazon SageMaker Studio.

Free for individual use.

openai/openai-cookbook: Examples and guides for using the OpenAI API

LMQL: Programming Large Language Models: “LMQL is a programming language for language model interaction.”

LMQL generalizes natural language prompting, making it more expressive while remaining accessible. For this, LMQL builds on top of Python, allowing users to express natural language prompts that also contain code. The resulting queries can be directly executed on language models like OpenAI's GPT models Fixed answer templates and intermediate instructions allow the user to steer the LLM's reasoning process.

AI assisted learning: Learning Rust with ChatGPT, Copilot and Advent of Code

Mitchell Hashimoto on the mysterious ease of ChatGPT plugins

Glean is a system for working with facts about source code. It is designed for collecting and storing detailed information about code structure, and providing access to the data to power tools and experiences from online IDE features to offline code analysis.

For example, Glean could answer all the questions you’d expect your IDE to answer, accurately and efficiently on a large-scale codebase. Things like:

Where is the definition of this method?Where are all the callers of this function?Who inherits from this class?What are all the declarations in this file?

Beurer-Kellner, Luca, Marc Fischer, and Martin Vechev. 2022.“Prompting Is Programming: A Query Language For Large Language Models.” arXiv.

Bubeck, Sébastien, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, et al. 2023.“Sparks of Artificial General Intelligence: Early Experiments with GPT-4.” arXiv.

Din, Alexander Yom, Taelin Karidi, Leshem Choshen, and Mor Geva. 2023.“Jump to Conclusions: Short-Cutting Transformers With Linear Transformations.”

Suzgun, Mirac, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, et al. 2022.“Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them.” arXiv.

Wang, Xuezhi, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023.“Self-Consistency Improves Chain of Thought Reasoning in Language Models.” arXiv.

Learning to approximate differential equations and other interpreatble physical dynamics with neural nets.
Related:Analysing a neural net itself*as* a dynamical system, which is not quite the same but crosses over, orlearning general recurrent dynamics.Variational state filters.
Where the parameters are meaningful, not just weights, we tend to think aboutsystem identification.

A deterministic version of this problem is what e.g. the famous Vector Institute Neural ODE paper(T. Q. Chen et al. 2018) did. AuthorDuvenaud argues that in some ways the hype ran away with the Neural ODE paper, and creditsCasADI with the innovations.

There arevariouslaypersons’introductions/tutorials inthis area, including the simple and practicalmagical take injulia. See alsothe CASADI example.

Learning an ODE in particular a purely deterministic process, feels unsatisfying; We want a model which encodes responses,and effects to interactions. It is not ideal to have time series models which need to encode everything in an initial state.

Also, we would prefer models to be stochastic.
Learnable*SDEs* are probably what we want.
I’m particularly interested onjump ODE regression.

Homework:Duvenaud again, tweeting some explanatory animations.

Note connection toreparameterization tricks, in that neural ODEs give you cheap differentiable reparameterizations.

Gu et al. (2021) unifies neural ODEs withRNNs.

How do you doensemble training for posterior predictives in NODEs? How do you guaranteestability in the learned dynamics?

Seerecursive identification for generic theory of learning under the distribution shift induced by a moving parameter vector.

Interesting package of tools from Christopher Ré’s lab, at the intersection ofrecurrent networks andlinear feedback systems. SeeHazyResearch/state-spaces: Sequence Modeling with Structured State Spaces. I find these aesthetically satisfying, because I spent 2 years of my PhD trying to solvethe same problem, and failed. These folks did a better job, so I find it slightly validating that the idea was not stupid.

- google-research/torchsde: Differentiable SDE solvers with GPU support and efficient sensitivity analysis.(Kidger et al. 2021;X. Li et al. 2020)
- Patrick Kidger’s thesis is the current canonical textbook on ODE learning(Kidger 2022).
- Corenflos et al. (2021) describe an optimal transport method
- Campbell et al. (2021) describes variational inference that factors out the unknown parameters.

Andersson, Joel A. E., Joris Gillis, Greg Horn, James B. Rawlings, and Moritz Diehl. 2019.“CasADi: A Software Framework for Nonlinear Optimization and Optimal Control.”*Mathematical Programming Computation* 11 (1): 1–36.

Anil, Cem, James Lucas, and Roger Grosse. 2018.“Sorting Out Lipschitz Function Approximation,” November.

Arridge, Simon, Peter Maass, Ozan Öktem, and Carola-Bibiane Schönlieb. 2019.“Solving Inverse Problems Using Data-Driven Models.”*Acta Numerica* 28 (May): 1–174.

Babtie, Ann C., Paul Kirk, and Michael P. H. Stumpf. 2014.“Topological Sensitivity Analysis for Systems Biology.”*Proceedings of the National Academy of Sciences* 111 (52): 18507–12.

Bachouch, Achref, Côme Huré, Nicolas Langrené, and Huyen Pham. 2020.“Deep Neural Networks Algorithms for Stochastic Control Problems on Finite Horizon: Numerical Applications.”*arXiv:1812.05916 [Math, q-Fin, Stat]*, January.

Campbell, Andrew, Yuyang Shi, Tom Rainforth, and Arnaud Doucet. 2021.“Online Variational Filtering and Parameter Learning.” In.

Chang, Bo, Minmin Chen, Eldad Haber, and Ed H. Chi. 2019.“AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks.” In*Proceedings of ICLR*.

Chang, Bo, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. 2018.“Multi-Level Residual Networks from Dynamical Systems View.” In*PRoceedings of ICLR*.

Chen, Boyuan, Kuang Huang, Sunand Raghupathi, Ishaan Chandratreya, Qiang Du, and Hod Lipson. 2022.“Automated Discovery of Fundamental Variables Hidden in Experimental Data.”*Nature Computational Science* 2 (7): 433–42.

Chen, Tian Qi, and David K Duvenaud. n.d.“Neural Networks with Cheap Differential Operators,” 11.

Chen, Tian Qi, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018.“Neural Ordinary Differential Equations.” In*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6572–83. Curran Associates, Inc.

Choromanski, Krzysztof, Jared Quincy Davis, Valerii Likhosherstov, Xingyou Song, Jean-Jacques Slotine, Jacob Varley, Honglak Lee, Adrian Weller, and Vikas Sindhwani. 2020.“An Ode to an ODE.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Corenflos, Adrien, James Thornton, George Deligiannidis, and Arnaud Doucet. 2021.“Differentiable Particle Filtering via Entropy-Regularized Optimal Transport.”*arXiv:2102.07850 [Cs, Stat]*, June.

Course, Kevin, Trefor Evans, and Prasanth Nair. 2020.“Weak Form Generalized Hamiltonian Learning.” In*Advances in Neural Information Processing Systems*. Vol. 33.

De Brouwer, Edward, Jaak Simm, Adam Arany, and Yves Moreau. 2019.“GRU-ODE-Bayes: Continuous Modeling of Sporadically-Observed Time Series.” In*Advances in Neural Information Processing Systems*. Vol. 32. Curran Associates, Inc.

Dupont, Emilien, Arnaud Doucet, and Yee Whye Teh. 2019.“Augmented Neural ODEs.”*arXiv:1904.01681 [Cs, Stat]*, April.

E, Weinan. 2017.“A Proposal on Machine Learning via Dynamical Systems.”*Communications in Mathematics and Statistics* 5 (1): 1–11.

———. 2021.“The Dawning of a New Era in Applied Mathematics.”*Notices of the American Mathematical Society* 68 (04): 1.

E, Weinan, Jiequn Han, and Qianxiao Li. 2018.“A Mean-Field Optimal Control Formulation of Deep Learning.”*arXiv:1807.01083 [Cs, Math]*, July.

Eguchi, Shoichi, and Yuma Uehara. n.d.“Schwartz-Type Model Selection for Ergodic Stochastic Differential Equation Models.”*Scandinavian Journal of Statistics* n/a (n/a).

Finlay, Chris, Jörn-Henrik Jacobsen, Levon Nurbekyan, and Adam M Oberman. n.d.“How to Train Your Neural ODE: The World of Jacobian and Kinetic Regularization.” In*ICML*, 14.

Finzi, Marc, Ke Alexander Wang, and Andrew G. Wilson. 2020.“Simplifying Hamiltonian and Lagrangian Neural Networks via Explicit Constraints.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Garnelo, Marta, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J. Rezende, and S. M. Ali Eslami. 2018.“Conditional Neural Processes.”*arXiv:1807.01613 [Cs, Stat]*, July, 10.

Garnelo, Marta, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. 2018.“Neural Processes,” July.

Gholami, Amir, Kurt Keutzer, and George Biros. 2019.“ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs.”*arXiv:1902.10298 [Cs]*, February.

Ghosh, Arnab, Harkirat Behl, Emilien Dupont, Philip Torr, and Vinay Namboodiri. 2020.“STEER : Simple Temporal Regularization For Neural ODE.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Gierjatowicz, Patryk, Marc Sabate-Vidales, David Šiška, Lukasz Szpruch, and Žan Žurič. 2020.“Robust Pricing and Hedging via Neural SDEs.”*arXiv:2007.04154 [Cs, q-Fin, Stat]*, July.

Grathwohl, Will, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018.“FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models.”*arXiv:1810.01367 [Cs, Stat]*, October.

Gu, Albert, Karan Goel, and Christopher Ré. 2021.“Efficiently Modeling Long Sequences with Structured State Spaces.”

Gu, Albert, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. 2021.“Combining Recurrent, Convolutional, and Continuous-Time Models with Linear State Space Layers.” In*Advances in Neural Information Processing Systems*, 34:572–85. Curran Associates, Inc.

Haber, Eldad, Felix Lucka, and Lars Ruthotto. 2018.“Never Look Back - A Modified EnKF Method and Its Application to the Training of Neural Networks Without Back Propagation.”*arXiv:1805.08034 [Cs, Math]*, May.

Han, Jiequn, Arnulf Jentzen, and Weinan E. 2018.“Solving High-Dimensional Partial Differential Equations Using Deep Learning.”*Proceedings of the National Academy of Sciences* 115 (34): 8505–10.

Haro, A. 2008.“Automatic Differentiation Methods in Computational Dynamical Systems: Invariant Manifolds and Normal Forms of Vector Fields at Fixed Points.”*IMA Note*.

Hasani, Ramin, Mathias Lechner, Alexander Amini, Lucas Liebenwein, Aaron Ray, Max Tschaikowski, Gerald Teschl, and Daniela Rus. 2022.“Closed-Form Continuous-Time Neural Networks.”*Nature Machine Intelligence* 4 (11): 992–1003.

Hasani, Ramin, Mathias Lechner, Alexander Amini, Daniela Rus, and Radu Grosu. 2020.“Liquid Time-Constant Networks.”*arXiv:2006.04439 [Cs, Stat]*, December.

He, Junxian, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. 2019.“Lagging Inference Networks and Posterior Collapse in Variational Autoencoders.” In*PRoceedings of ICLR*.

Holzschuh, Benjamin, Simona Vegetti, and Nils Thuerey. 2022.“Score Matching via Differentiable Physics,” 7.

Huh, In, Eunho Yang, Sung Ju Hwang, and Jinwoo Shin. 2020.“Time-Reversal Symmetric ODE Network.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Huré, Côme, Huyên Pham, Achref Bachouch, and Nicolas Langrené. 2018.“Deep Neural Networks Algorithms for Stochastic Control Problems on Finite Horizon, Part I: Convergence Analysis.”*arXiv:1812.04300 [Math, Stat]*, December.

Jia, Junteng, and Austin R Benson. 2019.“Neural Jump Stochastic Differential Equations.” In*Advances in Neural Information Processing Systems 32*, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, 9847–58. Curran Associates, Inc.

Kaul, Shiva. 2020.“Linear Dynamical Systems as a Core Computational Primitive.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Kelly, Jacob, Jesse Bettencourt, Matthew James Johnson, and David Duvenaud. 2020.“Learning Differential Equations That Are Easy to Solve.” In.

Kidger, Patrick. 2022.“On Neural Differential Equations.” Oxford.

Kidger, Patrick, Ricky T. Q. Chen, and Terry J. Lyons. 2021.“‘Hey, That’s Not an ODE’: Faster ODE Adjoints via Seminorms.” In*Proceedings of the 38th International Conference on Machine Learning*, 5443–52. PMLR.

Kidger, Patrick, James Foster, Xuechen Li, and Terry J. Lyons. 2021.“Neural SDEs as Infinite-Dimensional GANs.” In*Proceedings of the 38th International Conference on Machine Learning*, 5453–63. PMLR.

Kidger, Patrick, James Morrill, James Foster, and Terry Lyons. 2020.“Neural Controlled Differential Equations for Irregular Time Series.”*arXiv:2005.08926 [Cs, Stat]*, November.

Kochkov, Dmitrii, Alvaro Sanchez-Gonzalez, Jamie Smith, Tobias Pfaff, Peter Battaglia, and Michael P Brenner. 2020.“Learning Latent FIeld Dynamics of PDEs.” In*Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS)*, 7.

Kolter, J Zico, and Gaurav Manek. 2019.“Learning Stable Deep Dynamics Models.” In*Advances in Neural Information Processing Systems*, 9.

Krishnamurthy, Kamesh, Tankut Can, and David J. Schwab. 2020.“Theory of Gating in Recurrent Neural Networks.” In*arXiv:2007.14823 [Cond-Mat, Physics:nlin, q-Bio]*.

Lawrence, Nathan, Philip Loewen, Michael Forbes, Johan Backstrom, and Bhushan Gopaluni. 2020.“Almost Surely Stable Deep Dynamics.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Li, Xuechen, Ting-Kam Leonard Wong, Ricky T. Q. Chen, and David Duvenaud. 2020.“Scalable Gradients for Stochastic Differential Equations.” In*International Conference on Artificial Intelligence and Statistics*, 3870–82. PMLR.

Li, Yuhong, Tianle Cai, Yi Zhang, Deming Chen, and Debadeepta Dey. 2022.“What Makes Convolutional Models Great on Long Sequence Modeling?” arXiv.

Lou, Aaron, Derek Lim, Isay Katsman, Leo Huang, Qingxuan Jiang, Ser Nam Lim, and Christopher M. De Sa. 2020.“Neural Manifold Ordinary Differential Equations.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Lu, Lu, Pengzhan Jin, and George Em Karniadakis. 2020.“DeepONet: Learning Nonlinear Operators for Identifying Differential Equations Based on the Universal Approximation Theorem of Operators.”*arXiv:1910.03193 [Cs, Stat]*, April.

Lu, Yulong, and Jianfeng Lu. 2020.“A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Massaroli, Stefano, Michael Poli, Michelangelo Bin, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. 2020.“Stable Neural Flows.”*arXiv:2003.08063 [Cs, Math, Stat]*, March.

Massaroli, Stefano, Michael Poli, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. 2020.“Dissecting Neural ODEs.” In*arXiv:2002.08071 [Cs, Stat]*.

Mhammedi, Zakaria, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. 2017.“Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections.” In*PMLR*, 2401–9.

Mishler, Alan, and Edward Kennedy. 2021.“FADE: FAir Double Ensemble Learning for Observable and Counterfactual Outcomes.”*arXiv:2109.00173 [Cs, Stat]*, August.

Morrill, James, Patrick Kidger, Cristopher Salvi, James Foster, and Terry Lyons. 2020.“Neural CDEs for Long Time Series via the Log-ODE Method.” In, 5.

Nguyen, Long, and Andy Malinsky. 2020.“Exploration and Implementation of Neural Ordinary Diﬀerential Equations,” 34.

Niu, Murphy Yuezhen, Lior Horesh, and Isaac Chuang. 2019.“Recurrent Neural Networks in the Eye of Differential Equations.”*arXiv:1904.12933 [Quant-Ph, Stat]*, April.

Norcliffe, Alexander, Cristian Bodnar, Ben Day, Jacob Moss, and Pietro Liò. 2020.“Neural ODE Processes.” In.

Oreshkin, Boris N., Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2020.“N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting.”*arXiv:1905.10437 [Cs, Stat]*, February.

Palis, J. 1974.“Vector Fields Generate Few Diffeomorphisms.”*Bulletin of the American Mathematical Society* 80 (3): 503–5.

Peluchetti, Stefano, and Stefano Favaro. 2019.“Neural SDE - Information Propagation Through the Lens of Diffusion Processes.” In*Workshop on Bayesian Deep LEarning*, 7.

———. 2020.“Infinitely Deep Neural Networks as Diffusion Processes.” In*International Conference on Artificial Intelligence and Statistics*, 1126–36. PMLR.

Pfau, David, and Danilo Rezende. 2020.“Integrable Nonparametric Flows.” In, 7.

Poli, Michael, Stefano Massaroli, Atsushi Yamashita, Hajime Asama, and Jinkyoo Park. 2020a.“Hypersolvers: Toward Fast Continuous-Depth Models.” In*Advances in Neural Information Processing Systems*. Vol. 33.

———. 2020b.“TorchDyn: A Neural Differential Equations Library.”*arXiv:2009.09346 [Cs]*, September.

Rackauckas, Christopher. 2019.“The Essential Tools of Scientific Machine Learning (Scientific ML).”

Rackauckas, Christopher, Yingbo Ma, Vaibhav Dixit, Xingjian Guo, Mike Innes, Jarrett Revels, Joakim Nyberg, and Vijay Ivaturi. 2018.“A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions.”*arXiv:1812.01892 [Cs]*, December.

Rackauckas, Christopher, Yingbo Ma, Julius Martensen, Collin Warner, Kirill Zubov, Rohit Supekar, Dominic Skinner, Ali Ramadhan, and Alan Edelman. 2020.“Universal Differential Equations for Scientific Machine Learning.”*arXiv:2001.04385 [Cs, Math, q-Bio, Stat]*, August.

Ray, Deep, Orazio Pinti, and Assad A. Oberai. 2023.“Deep Learning and Computational Physics (Lecture Notes).”

Revach, Guy, Nir Shlezinger, Ruud J. G. van Sloun, and Yonina C. Eldar. 2021.“Kalmannet: Data-Driven Kalman Filtering.” In*ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, 3905–9.

Roeder, Geoffrey, Paul K. Grant, Andrew Phillips, Neil Dalchau, and Edward Meeds. 2019.“Efficient Amortised Bayesian Inference for Hierarchical and Nonlinear Dynamical Systems.”*arXiv:1905.12090 [Cs, Stat]*, May.

Ruthotto, Lars, and Eldad Haber. 2020.“Deep Neural Networks Motivated by Partial Differential Equations.”*Journal of Mathematical Imaging and Vision* 62 (3): 352–64.

Saemundsson, Steindor, Alexander Terenin, Katja Hofmann, and Marc Peter Deisenroth. 2020.“Variational Integrator Networks for Physically Structured Embeddings.”*arXiv:1910.09349 [Cs, Stat]*, March.

Sanchez-Gonzalez, Alvaro, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter Battaglia. 2020.“Learning to Simulate Complex Physics with Graph Networks.” In*Proceedings of the 37th International Conference on Machine Learning*, 8459–68. PMLR.

Schirmer, Mona, Mazin Eltayeb, Stefan Lessmann, and Maja Rudolph. 2022.“Modeling Irregular Time Series with Continuous Recurrent Units.” arXiv.

Schmidt, Jonathan, Nicholas Krämer, and Philipp Hennig. 2021.“A Probabilistic State Space Model for Joint Inference from Differential Equations and Data.”*arXiv:2103.10153 [Cs, Stat]*, June.

Shlezinger, Nir, Jay Whang, Yonina C. Eldar, and Alexandros G. Dimakis. 2020.“Model-Based Deep Learning.”*arXiv:2012.08405 [Cs, Eess]*, December.

Şimşekli, Umut, Ozan Sener, George Deligiannidis, and Murat A. Erdogdu. 2020.“Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks.”*CoRR* abs/2006.09313.

Stapor, Paul, Fabian Fröhlich, and Jan Hasenauer. 2018.“Optimization and Uncertainty Analysis of ODE Models Using 2nd Order Adjoint Sensitivity Analysis.”*bioRxiv*, February, 272005.

Thuerey, Nils, Philipp Holl, Maximilian Mueller, Patrick Schnell, Felix Trost, and Kiwon Um. 2021.*Physics-Based Deep Learning*. WWW.

Tran, Alasdair, Alexander Mathews, Cheng Soon Ong, and Lexing Xie. 2021.“Radflow: A Recurrent, Aggregated, and Decomposable Model for Networks of Time Series.” In*Proceedings of the Web Conference 2021*, 730–42. Ljubljana Slovenia: ACM.

Tzen, Belinda, and Maxim Raginsky. 2019.“Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit.”*arXiv:1905.09883 [Cs, Stat]*, October.

Vardasbi, Ali, Telmo Pessoa Pires, Robin M. Schmidt, and Stephan Peitz. 2023.“State Spaces Aren’t Enough: Machine Translation Needs Attention.” arXiv.

Vorontsov, Eugene, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. 2017.“On Orthogonality and Learning Recurrent Networks with Long Term Dependencies.” In*PMLR*, 3570–78.

Wang, Chuang, Hong Hu, and Yue M. Lu. 2019.“A Solvable High-Dimensional Model of GAN.”*arXiv:1805.08349 [Cond-Mat, Stat]*, October.

Wang, Rui, Robin Walters, and Rose Yu. 2022.“Data Augmentation Vs. Equivariant Networks: A Theory of Generalization on Dynamics Forecasting.” arXiv.

Wang, Sifan, Xinling Yu, and Paris Perdikaris. 2020.“When and Why PINNs Fail to Train: A Neural Tangent Kernel Perspective,” July.

Yang, Liu, Dongkun Zhang, and George Em Karniadakis. 2020.“Physics-Informed Generative Adversarial Networks for Stochastic Differential Equations.”*SIAM Journal on Scientific Computing* 42 (1): A292–317.

Yıldız, Çağatay, Markus Heinonen, and Harri Lähdesmäki. 2019.“ODE\(^2\)VAE: Deep Generative Second Order ODEs with Bayesian Neural Networks.”*arXiv:1905.10994 [Cs, Stat]*, October.

Zammit-Mangion, Andrew, and Christopher K. Wikle. 2020.“Deep Integro-Difference Equation Models for Spatio-Temporal Forecasting.”*Spatial Statistics* 37 (June): 100408.

Zhang, Han, Xi Gao, Jacob Unterman, and Tom Arodz. 2020.“Approximation Capabilities of Neural ODEs and Invertible Residual Networks.”*arXiv:1907.12998 [Cs, Stat]*, February.

Zhi, Weiming, Tin Lai, Lionel Ott, Edwin V. Bonilla, and Fabio Ramos. 2022.“Learning Efficient and Robust Ordinary Differential Equations via Invertible Neural Networks.” In*International Conference on Machine Learning*, 27060–74. PMLR.

I’m not a fan of dropbox as afile sync option, but sometimes you need to use it to communicate with a colleague. I talk about dropbox here, but also GDrive and Onedrive fit in here, as things that we sometimes need to work with but which I fundamentally do not trust.

The upshot of`rclone`

is that I can pull changes from dropbox into my git
repository thusly

`rclone sync --exclude=".git/" --update dropbox:ProjectForGitHaters/ ./ProjectForGitHaters/`

and push changes into dropbox from the git repo like*so*

`rclone sync --exclude=".git/" --update ./ProjectForGitHaters/ dropbox:ProjectForGitHaters/`

My colleagues need never know that I am using modern version control, change-tracking, merging, diffing and so on.

In practice, to exclude a lot of files at once,
I recycle a standard exclude list from`syncthing`

and replace`--exclude=".git/"`

with`--exclude-from=".stignore"`

in those commands.
And to make*sure* I am not accidentally syncing git repo stuff I use the`--dry-run`

option to verify that the expected files are getting copied/deleted/whatever.

Maestral is an open-source Dropbox client written in Python. The project’s main goal is to provide a client for platforms and file systems that are no longer directly supported by Dropbox. This was motivated by Dropbox temporarily dropping support for many Linux file systems but extends to systems that no longer meet Dropbox’s minimum requirement of glibc >= 2.19, such as CentOS 6 and 7. Limitations

Currently, Maestral does not support Dropbox Paper, the management of Dropbox teams and the management of shared folder settings. If you need any of this functionality, please use the Dropbox website or the official client.

Maestral uses the public Dropbox API which, unlike the official client, does not support transferring only those parts of a file which changed (“binary diff”). Maestral may therefore use more bandwidth that the official client. However, it will avoid uploadi

There are tools to turn even my awful unencrypted untrustworthy system into an encrypted ones.Cryptomator is one cuddly friendly option. So is the more austererclone, as mentioned earlier. These encrypt everything inside a certain sync, in particular stopping your snooping corporate sync provider from reading it. Even a crappy spying provider such as Google, Microsoft or Dropbox can be made safer. Both those options are free and simple.

The drawbacks that immediately occur to me are

- this does not help with sharing files with peers, who still need to decrypt stuff somehow (although that’s a challenge for any encrypted service)
- you still have to run the provider’s sync software on your computer, which means trusting their client code if not their server code.
- files are encrypted individually so you are still leaking some information about what kind of files they are in their size and usage patterns.
- There are several solutions to do this, but they are use AFAICT incompatible encryption, so I can’t benefit from sharing files across many apps securely

NB I could do this anyway by manually encrypting everything, but would I? No, because it’s slow and tedious. I want a nice GUI so that this option is easy and lazy.

If you don’t mind whether the files are local or not, you could use rclone’s encryption mode, which talks directly to the remote file store and also encrypts the content. Rclone can do everything.

This works pretty good. I was running dropbox and syncthing on aspare computer I had lying around campus to
automatically synchronise the Dropbox stuff I need .
One minus was that occasionally I got logged out of that machine when I was away, causing syncing to break.
These days I use`rclone`

to communicate with dropboxers, which is both cheaper and more reliable.

dbxfs orff3d or rclone (above) allow you mount the*remote* dropbox file system without installing Dropbox’s suspect client software.
This seems slow and clunky; I think I would only do this if I needed to coordinate on some dropbox thing*in realtime* but mistrusted the client.
This sounds like hell to me.
Can I not just use git and handle coordination in my own sweet time?
For my offline collaboration style, manual-and-asychronous syncing with rclone is better.

If I*must* use Dropbox, I could perhaps sandbox it.
Dropbox itself doesn’t seem to ship any of the sandbox systems natively, (aside:
Why not? Does some part of their business model depends on intrusive access to everything I do?).
I did try to acontainerized version usingdocker.
However in practice for me it was fragile and RAM-heavy, difficult to debug and overall not recommended.
Possibly othersandboxes would be better? But meh.

How to install the right versions of everything for some python code I am developing?
How to deploy that sustainably? How to share it with others?
There are two problems:installing the right package dependencies and keeping theright dependency versions for*this* project.
In python there are various integrated solutions that solve these two problems at once with varying degrees of success.
It is confusing and chaotic, unpleasant and bad.
Python is bad at this.
To use a language you should ideally not have to develop opinions about many long-running disputes some of which have lately resolved and many of which will probably stay with us forever.

In the before-times there were many python packaging standards. Distutils and what-not. AFAICT, unless I am migrating extremely old code I should ignore everything about these.

**tl;dr**: only pip and conda support hardware specification in practice.
Users of GPUs must ignore any other options, no matter how attractive all the other options might seem at first glance.

Many packages specify*local versions* for particular architectures as a part of their functionality.
For example,pytorch comes in various flavours, which when using`pip`

```
# CPU flavour
pip install torch==1.10.0+cpu -f https://download.pytorch.org/whl/cpu/torch_stable.html
# GPU flavour
pip install torch==1.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
```

The local version is given by the`+cpu`

or`+cu113`

bit, and it changes what code will be executed when using these packages.
Specifying a GPU version is essential for many machine learning projects (essential, that is, if I do not want my code to run orders of magnitude slower).
The details of how this can be controlled with regard to the python packaging ecosystem are somewhat contentious and complicated and thus not supported by any of the new wave options like`poetry`

or`pipenv`

.Brian Wilson argues,

During my dive into the open-source abyss that is ML packages and

`+localVersions`

I discovered lots of people have strong opinions about what it should not be and like to tell other people they're wrong. Other people with opinions about what it could be are too afraid of voicing them lest there be some unintended consequence. PSF has asserted what they believe to be the intended state inPEP-440 (no local versions published) but the solution (PEP-459) is not an ML Model friendly solution because the installation providers (pip, pipenv, poetry) don’t have enough standardized hooks into the underlying hardware (cpu vs gpu vs cuda lib stack) to even understand which version to pull, let alone the Herculean effort it would take to get even just pytorch to update their package metadata.

There is no evidence that this logjam will resolve any time soon.
It turns out thatPackaging projects with GPU code is hard.
Since I do neural network stuff and thus use GPU/CPU version of packages, this means that I can effectively ignore most of the python environment alternatives on this page.
The two that work areconda andpip, which support a minimum viable local version package system*de facto*, and if they are less smooth or pleasant than the new systems, at least I am not alone.

Least nerdview guide: Vicki Boykis,Alice in Python projectland.

Simplest readable guide ispython-packaging

PyPI Quick and Dirty, includes some good tips such as usingtwine to make it automaticker.

Official docs are no longer awful but are

*slightly*stale, and especially perfunctory forcompilationThere is a community effort to document the issues of compiled packages inpypackaging-native (tldr it is hard)

Kenneth Reitz shows rather than tells witha heavily documented setup.py

Try Zed Shaw’s signature aggressively cynical and reasonably practical explanation ofproject structure, with bonus explication of how you should expect much time wasting yak shaving like this if you want to do software.

- Or copypyskel.
- Or generate a project structure withcookiecutter.

updated:What the heck is

`pyproject.toml`

?

`pip`

The default python package installer. It isbest spelled as

`python -m pip install package_name`

To snapshot dependencies:

`python -m pip freeze > requirements.txt`

To restore dependencies:

`python -m pip install -r requirements.txt`

`venv`

works.
It is a good default choice, widely supported and adequate, if not awesome, workflow.

Versions look a little like this:

```
SomeProject
SomeProject == 1.3
SomeProject >= 1.2, < 2.0
SomeProject[foo, bar]
SomeProject ~= 1.4.2
SomeProject == 5.4 ; python_version < '3.8'
SomeProject ; sys_platform == 'win32'
requests [security] >= 2.8.1, == 2.8.* ; python_version < "2.7"
```

The`~=`

is a handy lazy first-stop; it permits point releases, but not minor releases, so e.g.`~=1.3.0`

will also satisfy itself with version 1.3.9.

`pipx`

Pro tip:pipx:

pipx is made specifically for application installation, as it adds isolation yet still makes the apps available in your shell: pipx creates an isolated environment for each application and its associated packages.

That is, pipx is an application that installs global applications for you. (There is a bootstrapping problem: How to install pipx itself.)

pip has a heavy cache overhead.
If disk space is at a premium, I invoke it as`pip --no-cache-dir`

.

A parallel system to pip, designed to do all the work of installing especially python software with hefty compiled dependencies.

There are two parts here with two separate licenses

- the anaconda python distribution
- the conda python package manager.

I am slightly confused about how these two relate (Can I install a non-anaconda python distribution through the conda package manager?) There distinction is important, since licensing anaconda can be expensive. See, e.g.

- Anaconda is not free for commercial use (anymore) so what are the alternatives?
- Conda/Anaconda no longer free to use?
- See alsomamba below, which aims to reduce licensing risk by reimplementing the more licensing-vulnerable parts of the anaconda ecosystem

Some things that are (or were?) painful to install by pip are painless via conda. Contrarywise, some things that are painful to install by conda are easy by pip.

I recommend working out which pain points are worse in this complicated decision by trial and error. Sometimes it would be worth the administrative burden of understanding conda’s current licensing and future licensing risks, but if it does not bring substantial value, choose pip.

This is an updated recommendation; previously I preferred conda — pip used to be much worse, and anaconda’s licensing used to be less restrictive. Now I think anaconda cannot be relied upon IP-wise.

Download e.g.Linux x64 Miniconda, from thedownload page.

```
bash Miniconda3-latest-Linux-x86_64.sh
# login/logout here
# or do something like `exec bash -` if you are fancy
# Less aggressive conda
conda config --set auto_activate_base false
# conda for fish users
conda init fish
```

Alternatively, tryminiforge: A conda-forge distribution orfastchan, fast.ai’s conda mini-distribution.

```
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh
```

It is very much worth installing one of these minimalist dists rather than the default anaconda distro, since anaconda default is gigantic but nonetheless does not have what I need, so it simply wastes space. Some of these might additionally have less onerous licensing than the mainline? I am not sure.

If I want to install something with tricky dependencies like ViTables, I do this:

```
conda install pytables=3.2
conda install pyqt=4
```

Aside: I usefish shell, so need to do someextra setup. Specifically, I add the line

`source (conda info --root)/etc/fish/conf.d/conda.fish`

into`~/.config/fish/config.fish`

.
These days this is automated by

`conda init fish`

Forjupyter compatibility one needs

`conda install nb_conda_kernels`

The main selling point of conda is that specifying dependencies for*ad hoc* python scripts or packages is easy.

Conda has a slightly different dependency management and packaging workflow than the pip ecosystem.
See, e.g.Tim Hoppper’s explanation
of this`environment.yml`

malarkey, or the creators’rationale andmanual.

Oneexports the current conda environment config, by convention, into`environment.yml`

.

`conda env export > environment.yml`

`conda env create --file environment.yml`

Which to use out of`conda env create`

and`conda create`

?
if it involves`.yaml`

environment configs then`conda env create`

.
Confusing errors and capability differences for these two is aquagmire of opaque errors, bad documentation and sadness.

One point of friction that I rapidly encountered is that the automatically-created environments are not terribly generic; I might specify from the command-line a package that I know will install sanely on any platform (`matplotlib`

, say) but the version as stored in the environment file is specific to where I installed it (macos, linux, windows…) and architecture (x64, ARM…)
For GPU software there are even more incompatibilities because there are more choices of architecture.
So to share environments with collaborators on different platforms, I need to…*be* them, I guess? Buy them new laptops that match my laptop?
idk this seems weird maybe I’m missing something.

**NB** Conda will fill up my hard disk if not regularly disciplined
viaconda clean.

`conda clean --all --yes`

If I have limited space in your home dir, might need to move the cache:

configure`PKGS_DIR`

in`~/.condarc`

:

`conda config --add pkgs_dirs /SOME/OTHER/PATH/.conda`

Possibly also required?

`chmod a-rwx ~/.conda`

I might also want to not have the gigantic MKL library installed,not being a fan. It comes baked in by default for most anaconda installs.I can usually disable it by request:

`conda create -n pynomkl python nomkl`

Clearly the packagers do not test this configuration so often, because it fails sometimes even for packages which notionally do not need MKL. Worth attempting, however. Between the various versions and installed copies, MKL alone was using about 10GB total on my mac when I last checked. I also try to reduce the number of copies of MKL by starting fromminiconda as my base anaconda distribution, cautiously adding things as I need them.

Local environment folder is more isolated, keeping packages in a local folder than keeping all environments somewhere global, where I need to remember what I named them all.

```
conda config --set env_prompt '({name})'
conda env create --prefix ./env/myenv --file environment_linux.yml
conda activate ./env/myenv
```

Gotcha: infish shell the first line needs to be

`conda config --set env_prompt '\({name}\)'`

I am not sure why. AFAIK, fish command substitution does not happen inside strings. Either way, this will add the line

`env_prompt: ({name})`

to`.condarc`

.

Mamba is a fully compatible drop-in replacement for conda. It was started in 2019 by Wolf Vollprecht.

Theintroductory blog post is an enlightening read, which also explains conda better than conda explains itself. Themamba 1.0 release announcement is also very good. Seemamba-org/mamba for more. The fact that the authors of this system can articulate their ideas is a major selling point IMO.

Itexplicitly targets package installation for less mainstream configurations such asR, andvscode development environments. In fact, it is not even python-specific.

Provide a convenient way to install developer tools in VSCode workspaces fromconda-forge withmicromamba. Get NodeJS, Go, Rust, Python or JupyterLab installed by running a single command.

It also inherits some of the debilities of conda, e.g. that dependencies are platform- and architecture- specific.

Robocorp tools claim to make conda install more generic.

RCC is a command-line tool that allows you to create, manage, and distribute Python-based self-contained automation packages - or robots 🤖 as we call them.

Together with therobot.yaml configuration file,

`rcc`

provides the foundation to build and share automation with ease.In short, the RCC toolchain helps you to get rid of the phrase: “Works on my machine” so that you can actually build and run your robots more freely.

`venv`

venv is now a built-in python virtual environment system in python 3. It doesn’t support python 2 but fixes various problems, e.g. it supports framework python on macOS which isimportant for GUIs, and is covered by the python docs in thepython virtual environment introduction.

```
# Create venv
python3 -m venv ./venv --prompt some_arbitrary_name
# or if we want to use system packages:
python3 -m venv ./venv --prompt some_arbitrary_name --system-site-packages
# Use venv from fish OR
source ./venv/bin/activate.fish
# Use venv from bash
source ./venv/bin/activate
```

`pyenv`

pyenv is the core tool of an ecosystem which eases and automates switching between python versions. Manages python and thus implicitly can be used as a manager for all the other managers. The new new hipness, at least for platforms other than windows, where it does not work.

BUT WHO MANAGES THE VIRTUALENV MANAGER MANAGER? Also, what is going on in this ecosystem of bits?Logan Jones explains:

**pyenv**manages multiple versions of Python itself.**virtualenv/venv**manages virtual environments for a specific Python version.**pyenv-virtualenv**manages virtual environments for across varying versions of Python.

Anyway, pyenv compiles a custom version of python and as such is extremely isolated from everything else. An introduction with emphasis on my area:Intro to Pyenv for Machine Learning.

```
#initial pyenv install
pyenv init
# install a specific python version
pyenv install 3.8.13
# ensure we can find that version
pyenv rehash
# switch to that version
pyenv shell 3.8.13
```

Of course, because this is adjacent to the python packaging ecosystem, it immediately becomes complicated and confusing when you try to interact with the rest of the ecosystem, e.g.,

pyenv-virtualenvwrapper is different from

`pyenv-virtualenv`

, which provides extended commands like`pyenv virtualenv 3.4.1 project_name`

to directly help out with managing virtualenvs.`pyenv-virtualenvwrapper`

helps in interacting with`virtualenvwrapper`

, but`pyenv-virtualenv`

provides more convenient commands, where virtualenvs are first-class pyenv versions, that can be (de)activated. That’s to say,`pyenv`

and`virtualenvwrapper`

are still separated while`pyenv-virtualenv`

is a nice combination.

Huh. I am already too bored to think. However, I did work out a command which installed a pyenv tensorflow with an isolated virtualenv:

```
brew install pyenv pyenv-virtualenv
pyenv install 3.8.6
pyenv virtualenv 3.8.6 tf2.4
pyenv activate tf2.4
pip install --upgrade pip wheel
pip install 'tensorflow-probability>=0.12' 'tensorflow<2.5' jupyter
```

Forfish shell you need toadd some special lines to`config.fish`

:

```
set -x PYENV_ROOT $HOME/.pyenv
set -x PATH $PYENV_ROOT/bin $PATH
## fish <3.1
# status --is-interactive; and . (pyenv init -|psub)
# status --is-interactive; and . (pyenv virtualenv-init -|psub)
## fish >=3.1
status --is-interactive; and pyenv init - | source
status --is-interactive; and pyenv virtualenv-init - | source
```

No! wait! The new new new hipness is`poetry`

.
All the other previous hipnesses were not the real eternal ultimate hipness that transcends time.
I know we said this every previous time, but*this* time its real and our love will last forever ONO.

**⛔️⛔️UPDATE⛔️⛔️**:
OK, turns out this love was not actually quite as eternal as it seemed.
Lovely elegant design does not make up for the fact that the project is logjammed and broken in various ongoing ways; seeIssue #4595: Governance—or, “what do we do with all these pull requests?”.
It might be usable if your needs are modest or you are prepared to jump into theproject discord, which seems to be where the poetry hobbyists organise, but since I want to use this project merely incidentally, as a tool to develop something*else*, hobbyist level of engagement is not something I can participate in.
poetry is not ready for prime-time.

Note also that poetry is having difficulty staying current with*local versions*, as made famous by CUDA-supporting packages.
There is an example of the kind of antics that make it workbelow.

Poetry is a tool for dependency management and packaging in Python. It allows you to declare the libraries your project depends on and it will manage (install/update) them for you.

From the introduction:

Packaging systems and dependency management in Python are rather convoluted and hard to understand for newcomers. Even for seasoned developers it might be cumbersome at times to create all files needed in a Python project:

`setup.py`

,`requirements.txt`

,`setup.cfg`

,`MANIFEST.in`

and the newly added`Pipfile`

.So I wanted a tool that would limit everything to a single configuration file to do: dependency management, packaging and publishing.

It takes inspiration in tools that exist in other languages, like

`composer`

(PHP) or`cargo`

(Rust).And, finally, I started

`poetry`

to bring another exhaustive dependency resolver to the Python community apart fromConda’s.## What about Pipenv?

In short: I do not like the CLI it provides, or some of the decisions made, and I think we can make a better and more intuitive one.

**Editorial side-note**: Low-key dissing on similarly-dysfunctional, competing projects is an important part of python packaging.

Lazy install is via this terrifying command line (do not run if you do not know what this does):

`curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python -`

Poetry could be regarded as a similar thing to`pipenv`

, in that it (per default, but not necessarily) manages the dependencies in a local`venv`

.
It has a much more full-service approach than systems built on`pip`

.
For example, it has its own dependency resolver, which makes use of modern dependency metadata but also will work with previous dependency specifications by brute force if needed.
It separates specified dependencies from the ones that it contingently resolves in practice, which means that the dependencies seem to transport much better thanconda, which generally requires you to hand-maintain a special dependency file full of just the stuff you actually wanted.
In practice the many small conveniences and thoughtful workflow are helpful.
For example, it will set up the current package for developing*per default* so that imports work as similarly as possible across this local environment and when it is distributed to users.

```
poetry config virtualenvs.create true
poetry config virtualenvs.in-project true # local venvs are easier for my brain.
```

The cache gets corrupted, giving errors about hashes:

```
rm -r ~/Library/Caches/pypoetry/cache
rm -r ~/Library/Caches/pypoetry/artifacts
```

However, poetry does not support installingbuild variants/profiles, which means I cannot install GPU software, so it is useless to me.

As mentioned above, the`poetry`

system does not support “local versions” well and thus in practice is onerous to use for machine learning applications.
There are workarounds:Instructions for installing PyTorch shows a representative installation specification for pytorch.

```
[tool.poetry.dependencies]
python = "^3.10"
numpy = "^1.23.2"
torch = { version = "1.12.1", source="torch"}
torchaudio = { version = "0.12.1", source="torch"}
torchvision = { version = "0.13.1", source="torch"}
[[tool.poetry.source]]
name = "torch"
url = "https://download.pytorch.org/whl/cu116"
secondary = true
```

Note that this produces various errors and reported downlaods gigabytes of supporing files uneccessarily, but apparently works eventually.

`pipenv`

**⛔️⛔️UPDATE⛔️⛔️**:
Note that the`pipenv`

system does not support “local versions” and thus in practice cannot be used for machine learning applications.
This project is dead to me.
(Bear in mind that my opinions about will become increasingly outdated depending on when you read this.)

`venv`

has a higher-level, er, …wrapper (?) interface (?) calledpipenv.

Pipenv is a production-ready tool that aims to bring the best of all packaging worlds to the Python world. It harnesses Pipfile, pip, and virtualenv into one single command.

I switched to pipenv from poetry because it looked like it might be less chaos than poetry. I think it is, although the race is close.

HOWEVER, it is still pretty awful.
TBH, I would just use plain pip and`requirements.txt`

which, while it is primitive and broken, is at least broken and primitive in a well-understood way.

At time of writing thepipenv website was 3 weeks into an outage, because dependency management is a quagmire of sadness andcomically broken management with terribleBus factor. However,the backup docs site is semi-functional, albeit too curt to be useful and AFAICT outdated. The documentation siteinside github is readable.

The dependency resolver is, as the poetry devs point out, broken in its own special ways.The procedure to install modern ML frameworks, for example, is gruelling.

An introduction showing pipenv and venv used together.

For my configuration, important configuration settings are

`export WORKON_HOME=~/.venvs`

To get the venv inside the project (required for sanity in my HPC) I need the following:

`export PIPENV_VENV_IN_PROJECT=1`

Pipenv willautomatically load dotenv files which is a nice touch.

Does python’s slapstick ongoing shambles of a failed consensus on dependency management system fill you with distrust? Do you have the vague feeling that perhaps you should use something else to manage python since python cannot manage itself? Seegeneric dependency managers for an overview

Supercomputing dep managerspack hasPython-specific support.

PyPI has hundreds of thousands of packages that are not yet in Spack, and

`pip`

may be a perfectly valid alternative to using Spack. The main advantage of Spack over`pip`

is its ability to compile non-Python dependencies. It can also build cythonized versions of a package or link to an optimized BLAS/LAPACK library like MKL, resulting in calculations that run orders of magnitudes faster. Spack does not offer a significant advantage over other python-management systems for installing and using tools like flake8 and sphinx. But if you need packages with non-Python dependencies like numpy and scipy, Spack will be very valuable to you.Anaconda is another great alternative to Spack, and comes with its own

`conda`

package manager. Like Spack, Anaconda is capable of compiling non-Python dependencies. Anaconda contains many Python packages that are not yet in Spack, and Spack contains many Python packages that are not yet in Anaconda. The main advantage of Spack over Anaconda is its ability to choose a specific compiler and BLAS/LAPACK or MPI library. Spack also has better platform support for supercomputers, and can build optimized binaries for your specific microarchitecture.

Look at all these other package/build systems, which I only just now noticed! How interoperable are they?

Hatch is a modern, extensible Python project manager.

Features:

- Standardizedbuild system with reproducible builds by default
- Robustenvironment management with support for custom scripts
- Easypublishing to PyPI or other indexes
- Version management
- Configurableproject generation with sane defaults
- ResponsiveCLI, ~2-3xfaster than equivalent tools

is a modern Python package and dependency manager supporting the latest PEP standards. But it is more than a package manager. It boosts your development workflow in various aspects. The most significant benefit is it installs and manages packages in a similar way to

`npm`

that doesn’t need to create a virtualenv at all!Feature highlights:

Make the easy things easy and the hard things possibleis an old motto from the Perl community. Flit is entirely focused on theeasy thingspart of that, and leaves the hard things up to other tools.Specifically, the easy things are pure Python packages with no build steps (neither compiling C code, nor bundling Javascript, etc.). The vast majority of packages on PyPI are like this: plain Python code, with maybe some static data files like icons included.

It’s easy to underestimate the challenges involved in distributing and installing code, because it seems like you just need to copy some files into the right place. There’s a whole lot of metadata and tooling that has to work together around that fundamental step. But with the right tooling, a developer who wants to release their code doesn’t need to know about most of that.

What, specifically, does Flit make easy?

`flit init`

helps you set up the information Flit needs about your package.- Subpackages are automatically included: you only need to specify the top-level package.
- Data files within a package directory are automatically included. Missing data files has been a common packaging mistake with other tools.
- The version number is taken from your package’s
`__version__`

attribute, so that always matches the version that tools like pip see.`flit publish`

uploads a package to PyPI, so you don’t need a separate tool to do this.Setuptools, the most common tool for Python packaging, now has shortcuts for many of the same things. But it has to stay compatible with projects published many years ago, which limits what it can do by default.

Here are two famous options.

If we want to claim that Bayesian statistics is complete as a theory of science we citeJaynes and Bretthorst (2003). It is not so simple as that, for various reasons. Bayesian statistics is not sufficient.

If you find yourself saying\(“P(A|B) = [P(A)P(B|A)]/P(B)\)*all the rest is commentary*”
then… a naïve reading of this apparent endorsement of Bayesian statistics both underestimates how*very much* commentary there is, and excludes the important stuff that is not usefully a commentary on measure-theoretic treatment of belief.

Don’t get me wrong, I use Bayesian statistical methods*all the time*, and these tools are incredibly useful to me.
Bayesian statistics is great for getting a handle on reasonable beliefs and even informally it is a great crutch to prevent us from being excessively certain.
But.

I am a humble practitioner, so my instinctually first objection is that*inference is mathematically challenging* for even many apparently basic problems, in that Bayesian reasoning frequently requires difficult mathematical manipulations of integral calculus, even for apparently basic problems.
Is the model simple but the likelihood non-exponential?
Is the likelihoodexponential? but the prior non-conjugate? I know no-one who can do this in their head, except with tedious slowness.
Even if the mathematics is not too awful, I often find that basic stuff is*computationally prohibitive*, in that even simple Bayesian logistic regression can grind my laptop to a halt.
Further, there aremany difficulties in practice.
e.g. If I start from sufficiently opinionated priors, I can fail to find the truth in any sensible budget of data.

But maybe that is only for doing “real” science and I should not worry so much about these details incasual use of Bayesian inference, where we are simply maintaining betting odds for various possibilities in our heads. Mayyyybe.

Even then, I at least simply lack time to be Bayesian for many questions of interest, unless I delude myself into thinking I am smarter than I am.
I might*claim* to be doing all manner of Bayesian arithmetic in my head when I update my opinions in real data, and I can offer you some odds.
However, I know that when I get out my notebook and attempt to cross-check my supposed “posterior updates” with pencil and paper, I am highly likely to have gotten my intuitive mental arithmetic wrong, even in apparently trivial cases.
Maybe some people are better at this than me, but I hold that even if you are substantially smarter than I, it is not much extra complexity you can assume in any given estimate before you need a notepad and a computer, or a computing cluster, to actually do the inference that we seem to want to claim we do.
This is not a thing that we can do mid-conversation.
It is a thing we can attempt offline with some quiet time, or maybe that we can get fast at for certain classes of commonly-observed problem.
It is not a way that we can afford to handle*many* novel hypotheses each day, what with the need to shower and eat etc.*Which* hypotheses do we treat rigorously?
That is in itself a difficult Bayesian optimisation in itself, with a forbiddingly complex mathematical and computationally cost attached.

More stringent theoreticians would raise profound theoretical objections to universalising claims to Bayes-as-science. A philosopher of science might mention that Bayesian statistics is not sufficient to generate hypotheses, only to compare them. A probabilist would mention that inference can be badly behaved if the model class does not contain a true model (and almost no model is true), and probably other objections besides. A decision theorist might demand that you actually provide an action rule to plug into your inference procedure because the likelihood principle does not apply in open-model settings. I am not expert in those theoretical domains, but I mention them here because they were name-checked during my training at least, and they are points easy to miss if you come to this field from the outside.

This page is for links and further notes on this theme of claims to using Bayesian statistics as a principle for daily life or the project of science.

Upon writing this piece I notice that I had already found some grumpy rationalists complaining about other rationalists’s Bayes cred, so… maybe there is sufficient snark here and I do not need to add any more.

Deborah Mayo in,“You May Believe You Are a Bayesian But You Are Probably Wrong” draws together some highlights of Stephen Senn(Senn 2011)

It is hard to see what exactly a Bayesian statistician is doing when interacting with a client. There is an initial period in which the subjective beliefs of the client are established. These prior probabilities are taken to be valuable enough to be incorporated in subsequent calculation. However, in subsequent steps the client is not trusted to reason. The reasoning is carried out by the statistician. As an exercise in mathematics it is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘downdating’ and indeed sometimes this point is made by Bayesians when discussing what their theory implies. …

Richard Ngo,Against strong bayesianism:

I want to lay out some intuitions about why bayesianism is not very useful as a conceptual framework for thinking either about AGI or human reasoning. This is not a critique of bayesian statistical methods; it’s instead aimed at the philosophical position that bayesianism defines an ideal of rationality which should inform our perspectives on less capable agents, also known as ”strong bayesianism”. Asdescribed here:

The Bayesian machinery is frequently used in statistics and machine learning, and some people in these fields believe it is very frequently the right tool for the job. I’ll call this position “weak Bayesianism.” There is a more extreme and more philosophical position, which I’ll call “strong Bayesianism,” that says that the Bayesian machinery is the

single correct wayto do not only statistics, but science and inductive inference in general — that it’s the “aspirin in willow bark” that makes science, and perhaps all speculative thought, work insofar as itdoeswork.

It is good. Go read it. Or, if you prefer visual learning…

Sometimes. This is fraught in theM-open setting, and far from guaranteed in thenonparametric setting. See(e.g.Stuart 2010) for the kind of extra work needing doing there.

Not usually, no. Equivalently, is the true model one of my hypotheses? No. SeeM-open. This has troubling implications for the quality of our inference if we are not careful, and in particular means we cannot just update with the likelihood and hope our inference is valid for all possible decision rules.

Some inference is so expensive that we cannot perform it exactly in a convenient time frame even for one model, let alone a large, or even infinite hypothesis space. Approximations and optimal estimates need to be considered.

TBC

Even in vanilla inference, this is a whole research field in itself. A starting prior is a delicate thing.

Should we go for an uninformative prior?
What an uninformative prior even*is* is a whole question.
Short version —an uninformative prior is a pain in the arse that will break your inference.

More generally, there are pathological cases where apparently-sensible priors might not be proper for weird reasons. SeeLarry Wasserman’s summary of a piece by Stone(Stone 1976).

Also, because Wasserman did a lot of complaining about this, seeFreedman’s Neglected Theorem:

it is easy to prove that for essentially any pair of Bayesians, each thinks the other is crazy.

TBC

nostalgebraist grumps about strong Bayes as a methodology for science. Key point: you do not have all the possible models, and you do not have the computational resource to assign them posterior likelihoods if you did; assuming that you can do Bayesian learning over them is thus a broken model for learning the world.

TBC — connection to Lakatosian-style falsificationism.

No, it is extremely useful. It is simply not magic. It is hard, has weird failure modes, and intimidating mental and computational requirements. Being sloppy about this will cause me to update my priors towards you making mistakes about your posteriors.

Pascal’s wager, Pascal’s mugging etc.Duncan (2007);Neiva (2022).

Duncan, Craig. 2007.“The Persecutor’s Wager.”*The Philosophical Review* 116 (1): 1–50.

Gelman, Andrew, and Cosma Rohilla Shalizi. 2013.“Philosophy and the Practice of Bayesian Statistics.”*British Journal of Mathematical and Statistical Psychology* 66 (1): 8–38.

Gelman, Andrew, and Yuling Yao. 2021.“Holes in Bayesian Statistics.”*Journal of Physics G: Nuclear and Particle Physics* 48 (1): 014002.

Jaynes, Edwin Thompson, and G Larry Bretthorst. 2003.*Probability Theory: The Logic of Science*. Cambridge, UK; New York, NY: Cambridge University Press.

Kleijn, B. J. K., and A. W. van der Vaart. 2006.“Misspecification in Infinite-Dimensional Bayesian Statistics.”*The Annals of Statistics* 34 (2): 837–77.

Neiva, André. 2022.“Pascal’s Wager and Decision-Making with Imprecise Probabilities.”*Philosophia*, October.

Nickl, Richard. 2014.“Discussion of:‘Frequentist Coverage of Adaptive Nonparametric Bayesian Credible Sets’.”*arXiv:1410.7600 [Math, Stat]*, October.

Rousseau, Judith. 2016.“On the Frequentist Properties of Bayesian Nonparametric Methods.”*Annual Review of Statistics and Its Application* 3 (1): 211–31.

Senn, Stephen. 2011.“You May Believe You Are a Bayesian But You Are Probably Wrong,” 19.

Shalizi, Cosma Rohilla. 2009.“Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.”*Electronic Journal of Statistics* 3: 1039–74.

Stone, Mervyn. 1976.“Strong Inconsistency from Uniform Priors.”*Journal of the American Statistical Association* 71 (353): 114–16.

Stuart, Andrew M. 2010.“Inverse Problems: A Bayesian Perspective.”*Acta Numerica* 19: 451–559.

Szabó, Botond, Aad van der Vaart, and Harry van Zanten. 2013.“Frequentist Coverage of Adaptive Nonparametric Bayesian Credible Sets.”*arXiv:1310.4489 [Math, Stat]*, October.

Stochastic optimization, uses noisy (possibly approximate) 1st-order gradient information to find the argument which minimises

\[ x^*=\operatorname{arg min}_{\mathbf{x}} f(x) \]

for some an objective function\(f:\mathbb{R}^n\to\mathbb{R}\).

That this works with little fuss in very high dimensions is a major pillar ofdeep learning.

The original version, in terms of root finding, is(Herbert Robbins and Monro 1951) who later generalised analysis in(H. Robbins and Siegmund 1971), usingmartingale arguments to analyze convergence. There is some historical context in(Lai 2003) which puts it all in context. That article was written before the current craze for SGD in deep learning; after 2013 or so the problem is rather that there is so much information on the method that the challenge becomes sifting out the AI hype from the useful.

I recommend Francis Bach’sSum of geometric series trick as an introduction to showing things advanced things about SGD using elementary tools.

Francesco Orabana onhow to prove SGD converges:

to balance the universe of first-order methods, I decided to show how to easily prove the convergence of the iterates in SGD, even in unbounded domains.

Gradient flows are a continuous-limit SDE modelling stochastic gradient. Seegradient flows.

🏗

Zeyuan Allen-Zhu :Faster Than SGD 1: Variance Reduction:

SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1),

Zhiyuan Li and Sanjeev Arora argue:

You may remember ourprevious blog post showing that it is possible to do state-of-the-art deep learning with learning rate that increases exponentially during training. It was meant to be a dramatic illustration that what we learned in optimization classes and books isn’t always a good fit for modern deep learning, specifically,

normalized nets, which is our term for nets that use any one of popular normalization schemes,e.g.BatchNorm (BN),GroupNorm (GN),WeightNorm (WN). Today’s post (based uponour paper with Kaifeng Lyu at NeurIPS20) identifies other surprising incompatibilities between normalized nets and traditional analyses.

SeeSGMCMC.

…

Yellowfin an automatic SGD momentum tuner

Mini-batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideallyautomatically calculated. It’s not clear at all that it should work, except by collating all your data andoptimising offline, except that much of modern machine learning shows that it does.

Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.

Technically, “online” optimisation inbandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.

In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.

Some of the same technology pops up in each of these notions of online optimisation, but I am mostly thinking about SGD here.

There are many more permutations and variations used in practice.

Ahn, Sungjin, Anoop Korattikara, and Max Welling. 2012.“Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring.” In*Proceedings of the 29th International Coference on International Conference on Machine Learning*, 1771–78. ICML’12. Madison, WI, USA: Omnipress.

Alexos, Antonios, Alex J. Boyd, and Stephan Mandt. 2022.“Structured Stochastic Gradient MCMC.” In*Proceedings of the 39th International Conference on Machine Learning*, 414–34. PMLR.

Arya, Gaurav, Moritz Schauer, Frank Schäfer, and Christopher Vincent Rackauckas. 2022.“Automatic Differentiation of Programs with Discrete Randomness.” In.

Bach, Francis R., and Eric Moulines. 2013.“Non-Strongly-Convex Smooth Stochastic Approximation with Convergence Rate O(1/n).” In*arXiv:1306.2119 [Cs, Math, Stat]*, 773–81.

Bach, Francis, and Eric Moulines. 2011.“Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning.” In*Advances in Neural Information Processing Systems (NIPS)*, –. Spain.

Benaïm, Michel. 1999.“Dynamics of Stochastic Approximation Algorithms.” In*Séminaire de Probabilités de Strasbourg*, 33:1–68. Lecture Notes in Math. Berlin: Springer, Berlin.

Bensoussan, Alain, Yiqun Li, Dinh Phan Cao Nguyen, Minh-Binh Tran, Sheung Chi Phillip Yam, and Xiang Zhou. 2020.“Machine Learning and Control Theory.”*arXiv:2006.05604 [Cs, Math, Stat]*, June.

Botev, Zdravko I., and Chris J. Lloyd. 2015.“Importance Accelerated Robbins-Monro Recursion with Applications to Parametric Confidence Limits.”*Electronic Journal of Statistics* 9 (2): 2058–75.

Bottou, Léon. 1991.“Stochastic Gradient Learning in Neural Networks.” In*Proceedings of Neuro-Nîmes 91*. Nimes, France: EC2.

———. 1998.“Online Algorithms and Stochastic Approximations.” In*Online Learning and Neural Networks*, edited by David Saad, 17:142. Cambridge, UK: Cambridge University Press.

———. 2010.“Large-Scale Machine Learning with Stochastic Gradient Descent.” In*Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010)*, 177–86. Paris, France: Springer.

Bottou, Léon, and Olivier Bousquet. 2008.“The Tradeoffs of Large Scale Learning.” In*Advances in Neural Information Processing Systems*, edited by J.C. Platt, D. Koller, Y. Singer, and S. Roweis, 20:161–68. NIPS Foundation (http://books.nips.cc).

Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2016.“Optimization Methods for Large-Scale Machine Learning.”*arXiv:1606.04838 [Cs, Math, Stat]*, June.

Bottou, Léon, and Yann LeCun. 2004.“Large Scale Online Learning.” In*Advances in Neural Information Processing Systems 16*, edited by Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf. Cambridge, MA: MIT Press.

Bubeck, Sébastien. 2015.*Convex Optimization: Algorithms and Complexity*. Vol. 8. Foundations and Trends in Machine Learning. Now Publishers.

Cevher, Volkan, Stephen Becker, and Mark Schmidt. 2014.“Convex Optimization for Big Data.”*IEEE Signal Processing Magazine* 31 (5): 32–43.

Chen, Tianqi, Emily Fox, and Carlos Guestrin. 2014.“Stochastic Gradient Hamiltonian Monte Carlo.” In*Proceedings of the 31st International Conference on Machine Learning*, 1683–91. Beijing, China: PMLR.

Chen, Xiaojun. 2012.“Smoothing Methods for Nonsmooth, Nonconvex Minimization.”*Mathematical Programming* 134 (1): 71–99.

Chen, Zaiwei, Shancong Mou, and Siva Theja Maguluri. 2021.“Stationary Behavior of Constant Stepsize SGD Type Algorithms: An Asymptotic Characterization.” arXiv.

Di Giovanni, Francesco, James Rowbottom, Benjamin P. Chamberlain, Thomas Markovich, and Michael M. Bronstein. 2022.“Graph Neural Networks as Gradient Flows.” arXiv.

Domingos, Pedro. 2020.“Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.”*arXiv:2012.00152 [Cs, Stat]*, November.

Duchi, John, Elad Hazan, and Yoram Singer. 2011.“Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.”*Journal of Machine Learning Research* 12 (Jul): 2121–59.

Friedlander, Michael P., and Mark Schmidt. 2012.“Hybrid Deterministic-Stochastic Methods for Data Fitting.”*SIAM Journal on Scientific Computing* 34 (3): A1380–1405.

Ghadimi, Saeed, and Guanghui Lan. 2013a.“Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming.”*SIAM Journal on Optimization* 23 (4): 2341–68.

———. 2013b.“Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming.”*arXiv:1310.3787 [Math]*, October.

Goh, Gabriel. 2017.“Why Momentum Really Works.”*Distill* 2 (4): e6.

Hazan, Elad, Kfir Levy, and Shai Shalev-Shwartz. 2015.“Beyond Convexity: Stochastic Quasi-Convex Optimization.” In*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1594–1602. Curran Associates, Inc.

Heyde, C. C. 1974.“On Martingale Limit Theory and Strong Convergence Results for Stochastic Approximation Procedures.”*Stochastic Processes and Their Applications* 2 (4): 359–70.

Hu, Chonghai, Weike Pan, and James T. Kwok. 2009.“Accelerated Gradient Methods for Stochastic Optimization and Online Learning.” In*Advances in Neural Information Processing Systems*, 781–89. Curran Associates, Inc.

Jakovetic, D., J.M. Freitas Xavier, and J.M.F. Moura. 2014.“Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks.”*IEEE Transactions on Signal Processing* 62 (4): 868–82.

Kidambi, Rahul, Praneeth Netrapalli, Prateek Jain, and Sham M. Kakade. 2023.“On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization.” In.

Kingma, Diederik, and Jimmy Ba. 2015.“Adam: A Method for Stochastic Optimization.”*Proceeding of ICLR*.

Lai, Tze Leung. 2003.“Stochastic Approximation.”*The Annals of Statistics* 31 (2): 391–406.

Lee, Jason D., Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2017.“First-Order Methods Almost Always Avoid Saddle Points.”*arXiv:1710.07406 [Cs, Math, Stat]*, October.

Lee, Jason D., Max Simchowitz, Michael I. Jordan, and Benjamin Recht. 2016.“Gradient Descent Converges to Minimizers.”*arXiv:1602.04915 [Cs, Math, Stat]*, March.

Liu, Qiang, and Dilin Wang. 2019.“Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In*Advances In Neural Information Processing Systems*.

Ljung, Lennart, Georg Pflug, and Harro Walk. 1992.*Stochastic Approximation and Optimization of Random Systems*. Basel: Birkhäuser.

Ma, Siyuan, and Mikhail Belkin. 2017.“Diving into the Shallows: A Computational Perspective on Large-Scale Shallow Learning.”*arXiv:1703.10622 [Cs, Stat]*, March.

Maclaurin, Dougal, David Duvenaud, and Ryan P. Adams. 2015.“Early Stopping as Nonparametric Variational Inference.” In*Proceedings of the 19th International Conference on Artificial Intelligence and Statistics*, 1070–77. arXiv.

Mairal, Julien. 2013.“Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization.” In*Advances in Neural Information Processing Systems*, 2283–91.

Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017.“Stochastic Gradient Descent as Approximate Bayesian Inference.”*JMLR*, April.

McMahan, H. Brendan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, et al. 2013.“Ad Click Prediction: A View from the Trenches.” In*Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 1222–30. KDD ’13. New York, NY, USA: ACM.

Mitliagkas, Ioannis, Ce Zhang, Stefan Hadjis, and Christopher Ré. 2016.“Asynchrony Begets Momentum, with an Application to Deep Learning.”*arXiv:1605.09774 [Cs, Math, Stat]*, May.

Neu, Gergely, Gintare Karolina Dziugaite, Mahdi Haghifam, and Daniel M. Roy. 2021.“Information-Theoretic Generalization Bounds for Stochastic Gradient Descent.”*arXiv:2102.00931 [Cs, Stat]*, August.

Nguyen, Lam M., Jie Liu, Katya Scheinberg, and Martin Takáč. 2017.“Stochastic Recursive Gradient Algorithm for Nonconvex Optimization.”*arXiv:1705.07261 [Cs, Math, Stat]*, May.

Patel, Vivak. 2017.“On SGD’s Failure in Practice: Characterizing and Overcoming Stalling.”*arXiv:1702.00317 [Cs, Math, Stat]*, February.

Polyak, B. T., and A. B. Juditsky. 1992.“Acceleration of Stochastic Approximation by Averaging.”*SIAM Journal on Control and Optimization* 30 (4): 838–55.

Reddi, Sashank J., Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. 2016.“Stochastic Variance Reduction for Nonconvex Optimization.” In*PMLR*, 1603:314–23.

Robbins, Herbert, and Sutton Monro. 1951.“A Stochastic Approximation Method.”*The Annals of Mathematical Statistics* 22 (3): 400–407.

Robbins, H., and D. Siegmund. 1971.“A Convergence Theorem for Non Negative Almost Supermartingales and Some Applications.” In*Optimizing Methods in Statistics*, edited by Jagdish S. Rustagi, 233–57. Academic Press.

Ruder, Sebastian. 2016.“An Overview of Gradient Descent Optimization Algorithms.”*arXiv:1609.04747 [Cs]*, September.

Sagun, Levent, V. Ugur Guney, Gerard Ben Arous, and Yann LeCun. 2014.“Explorations on High Dimensional Landscapes.”*arXiv:1412.6615 [Cs, Stat]*, December.

Salimans, Tim, and Diederik P Kingma. 2016.“Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks.” In*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 901–1. Curran Associates, Inc.

Shalev-Shwartz, Shai, and Ambuj Tewari. 2011.“Stochastic Methods for L1-Regularized Loss Minimization.”*Journal of Machine Learning Research* 12 (July): 1865–92.

Şimşekli, Umut, Ozan Sener, George Deligiannidis, and Murat A. Erdogdu. 2020.“Hausdorff Dimension, Stochastic Differential Equations, and Generalization in Neural Networks.”*CoRR* abs/2006.09313.

Smith, Samuel L., Benoit Dherin, David Barrett, and Soham De. 2020.“On the Origin of Implicit Regularization in Stochastic Gradient Descent.” In.

Spall, J. C. 2000.“Adaptive Stochastic Approximation by the Simultaneous Perturbation Method.”*IEEE Transactions on Automatic Control* 45 (10): 1839–53.

Vishwanathan, S.V. N., Nicol N. Schraudolph, Mark W. Schmidt, and Kevin P. Murphy. 2006.“Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods.” In*Proceedings of the 23rd International Conference on Machine Learning*.

Welling, Max, and Yee Whye Teh. 2011.“Bayesian Learning via Stochastic Gradient Langevin Dynamics.” In*Proceedings of the 28th International Conference on International Conference on Machine Learning*, 681–88. ICML’11. Madison, WI, USA: Omnipress.

Wright, Stephen J., and Benjamin Recht. 2021.*Optimization for Data Analysis*. New York: Cambridge University Press.

Xu, Wei. 2011.“Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent.”*arXiv:1107.2490 [Cs]*, July.

Zhang, Xiao, Lingxiao Wang, and Quanquan Gu. 2017.“Stochastic Variance-Reduced Gradient Descent for Low-Rank Matrix Recovery from Linear Measurements.”*arXiv:1701.00481 [Stat]*, January.

Zinkevich, Martin, Markus Weimer, Lihong Li, and Alex J. Smola. 2010.“Parallelized Stochastic Gradient Descent.” In*Advances in Neural Information Processing Systems 23*, edited by J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, 2595–2603. Curran Associates, Inc.

\[\renewcommand{\var}{\operatorname{Var}} \renewcommand{\corr}{\operatorname{Corr}} \renewcommand{\dd}{\mathrm{d}} \renewcommand{\bb}[1]{\mathbb{#1}} \renewcommand{\vv}[1]{\boldsymbol{#1}} \renewcommand{\rv}[1]{\mathsf{#1}} \renewcommand{\vrv}[1]{\vv{\rv{#1}}} \renewcommand{\disteq}{\stackrel{d}{=}} \renewcommand{\dif}{\backslash} \renewcommand{\gvn}{\mid} \renewcommand{\Ex}{\mathbb{E}} \renewcommand{\Pr}{\mathbb{P}}\]

The basic inspiration ofmessage-passing inference which turns out to be not always implementable, but gives us some basic inspiration. A concrete, implementable version of use to me isGaussian Belief Propagation. See alsograph computations for a more general sense of Message Passing, andgraph NNs for the same idea but with more neural dust sprinkled on it.

The grandparent of Belief Propagation isPearl (1982) forDAGs and then generalised tovarious other graphical models later. This definition subsumes such diverse methods as the Viterbi and Baum-Welch algorithms andstate filters.

Bishop (2006)’s notation is increasingly standard. We can set this up a few ways but a nice general one isfactor graphs. Below is a factor graph; if it makes no sense maybe dash over to that page and read about ’em. We assume that the factor graphs are trees, which means that the variables are all connected, without any loops.

If we are dealing with continuous RVs this factor graph corresponds to a density function which factorises\[ p(\vv{x})=f_a(x_1)f_b(x_1,x_2)f_d(x_2)f_c(x_2,x_3)f_g(x_3,x_4,x_5)f_k(x_5). \] For reasons of tradition we often assume the variables are discrete at this point and work with sums. I find densities more intuitive.

For the moment we use the mnemonic that\(p\)s are densities and thus normalised, whereas\(f\)s are factors, which are not necessarily normalised (but are non-negative and generate finite Lebesgue measures over their arguments, so they can be normalized.)

OK, what can we do with this kind of factorisation? I followDavison and Ortiz (2019)’s explanation, which is the shortest one I can find that is still comprehensible.

Suppose we want to find the marginal density at some node\[ p(x_n)=\int p(\vv{x})\dd(\vv{x}\dif x). \] Read\(\dd (\vv{x}\dif x)\) as “an integral over all axes except\(x\)”, e.g.\(\dd (\vv{x}\dif x_1)=\dd x_2 \dd x_3 \dd x_4\dots\)

We can write the density at\(x\) in terms of a product over the density at each of the factors connected to it.\[p(\vv{x})=\prod_{s\in n(s)}F_s(x,\vv{x}_s)\] Here\(n(x)\) is the set of factor nodes that are neighbours of\(x\).\(F_{s}\) is the product of all factors in the group associated with\(f_{s}.\)\(\vv{x}_{s}\) is the vector of all variables in the subtree connected to\(x\) via\(f_{s}\). Combining these,\[\begin{aligned} p(x) &=\int \left[\prod_{s \in n(x)} F_{s}\left(x, \vv{x}_{s}\right)\right] \dd(\vv{x}\dif x)\\ &=\prod_{s \in n(x)}\left[\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s)\right]. \end{aligned}\] That second step, reordering the integral and product, is pretty much our whole justification. How can we express that in terms of operations on a factor graph?

A clue is: These inner integrals\(\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s)\) are marginal densities over an individual\(x,\) and we have removed the dependence on the other arguments in this marginalisation; Looking ‘through’ each integral we see only one argument, so we know that those other arguments are in some sense irrelevant from where we sit with regard to calculating this marginal. Next we represent this decomposition of integrals and products into messages between axes that correspond to marginalising out dependencies, and thus removing variables from our consideration from our current position.

In doing so we assume that these marginal integrals have convenient and analytic solutions, which is great for proving interesting theories.
In practice…
If that assumption looks dangerous, that is because it should.
These integrals cannot, in general, be so solved, and later on we make all kinds of approximations to make this go.^{1}

Anyway, let us ignore tedious calculus problems for now and assume that works. Describing a recipe which actually does this message passing looks confusing because it involves some notation challenges, but it is not awful if we dedicate some time to staring and swearing at it.

There are two types of message needed: factor to variable messages, which will be multiplied to give the marginal distribution of that variable, and variable to factor messages, which are harder to explain punchily but eventually make sense if we stare long enough.

Ortiz et al explain it thusly:

Belief Update:The variable node beliefs are updated by taking a product of the incoming messages from all adjacent factors, each of which represents that factor’s belief on the receiving node’s variables.

Factor-to-variable message:To send a message to an adjacent variable node, a factor aggregates messages from all other adjacent variable nodes and marginalizes over all the other nodes’variables to produce a message that expresses the factor’s belief over the receiving node’s variables.

Variable-to-factor message:A variable-to-factor message tells the factor what the belief of the variable would be if the receiving factor node did not exist. This is computed by taking the product of the messages the variable node has received from all other factor nodes.

The factor to variable message looks like this:\[ \mu_{f_{s} \rightarrow x}(x)=\int F_{s}\left(x, \vv{x}_{s}\right)\dd(\vv{x}_s). \] This term\(\mu_{f_{s} \rightarrow x}(x)\) we think of as a message from factor\(f_{s}\) to variable\(x\) The message has the form of a function over variable\(x\) only, which corresponds to marginalised probability over\(x\) as the result of considering all factors in one branch of the tree. Thus,\[ p(x) = \prod_{s\in n(x)}\mu_{f_{s} \rightarrow x}(x). \] That is the easy bit. Next is the slightly weirder variable-to-factor message.\[\begin{aligned} \mu_{x_{m} \rightarrow f_{s}}\left(x_{m}\right) &=\int \prod_{l \in n\left(x_{m}\right) \backslash f_{s}} F_{l}\left(x_{m}, \vv{x}_{m_{l}}\right) \dd \vv{x}_{sm}\\ &=\prod_{l \in n\left(x_{m}\right) \backslash f_{s}} \int F_{l}\left(x_{m}, \vv{x}_{m_{l}}\right) \dd \vv{x}_{m_{l}}\\ &=\prod_{l \in n\left(x_{m}\right) \backslash f_{s}} \mu_{f_{l} \rightarrow x_{m}}\left(x_{m}\right). \end{aligned}\]

TBC.

We can invent a lot of extra machinery here to do belief propagation in ever more sophisticated ways. A classic example is the junction tree, which is a particular non-unique alternative decomposition of the graphical models in a kind of clustered graph. AFAICT most of these extended belief-y algorithms are interesting but complicated, hard to implement in general, and have been useless to me thus far for practical purposes. Exceptions:

- Quantifyingindependence with these tools leads us tocausal inference, which has indeed been useful for me.
- Gaussian Belief Propagation turns out to be what I am writing a paper on right now.

For purely discrete RVs it is OK, but that is a setting I almost never care about, except for generating neat graphs forcausal inference thought experiments.

Having expressed all those negative sentiments about practical application of exact belief passing, nonetheless it is pedagogically useful to learn at least this much, because it provides a heuristic motivation and structure forvariational message passing, which does approximately work, approximately generally.

For the dedicated there are many treatises on this topic in the literature, going deep into the most tedious recesses of the subject and they make virtuous and improving class exercises, but offer diminishing returns in understanding. I would to prefer texts which build up this machinery towards modern application(Bishop 2006) or for profound reinterpretation(Koller and Friedman 2009) if I wanted to put this into a wider perspective.

Pedagogically useful, although probably not industrial-grade, David Barber’sdiscrete graphical model code (Julia) can do queries over graphical models.

similar domain, more hype:deepmind/PGMax: Loopy belief propagation for factor graphs on discrete variables in JAX(Zhou et al. 2023)

danbar/fglib: factor graph library

The factor graph library (fglib) is a Python package to simulate message passing on factor graphs. It supports the

- sum-product algorithm (belief propagation)
- max-product algorithm
- max-sum algorithm
- mean-field algorithm - in development

with discrete and Gaussian random variables.

Pure numpy

Krashkov’s summary notebooks

- Belief-Propagation/1-SummaryPGMandBP.ipynb at master · krashkov/Belief-Propagation · GitHub
- Belief-Propagation/2-ImplementationFactor.ipynb at master · krashkov/Belief-Propagation · GitHub
- Belief-Propagation/3-ImplementationPGM.ipynb at master · krashkov/Belief-Propagation · GitHub
- Belief-Propagation/4-ImplementationBP.ipynb at master · krashkov/Belief-Propagation · GitHub

Joe Ortiz’ GBP imeplementations are elegant. Here are two python ones:

pgmpy is a pure python implementation for Bayesian Networks with a focus on modularity and extensibility. Implementations of various algorithms for Structure Learning, Parameter Estimation, Approximate (Sampling Based) and Exact inference, and Causal Inference are available.

See theBelief Propagation for GBP.

Bishop, Christopher M. 2006.*Pattern Recognition and Machine Learning*. Information Science and Statistics. New York: Springer.

Buntine, W. L. 1994.“Operations for Learning with Graphical Models.”*Journal of Artificial Intelligence Research* 2 (1): 159–225.

Davison, Andrew J., and Joseph Ortiz. 2019.“FutureMapping 2: Gaussian Belief Propagation for Spatial AI.”*arXiv:1910.14139 [Cs]*, October.

Erdogdu, Murat A., Yash Deshpande, and Andrea Montanari. 2017.“Inference in Graphical Models via Semidefinite Programming Hierarchies.”*arXiv:1709.06525 [Cs, Stat]*, September.

Kirkley, Alec, George T. Cantwell, and M. E. J. Newman. 2021.“Belief Propagation for Networks with Loops.”*Science Advances* 7 (17): eabf1211.

Koller, Daphne, and Nir Friedman. 2009.*Probabilistic Graphical Models : Principles and Techniques*. Cambridge, MA: MIT Press.

Liu, Qiang, and Alexander Ihler. 2011.“Bounding the Partition Function Using Hölder’s Inequality.” In*Proceedings of the 28th International Conference on International Conference on Machine Learning*, 849–56. ICML’11. Madison, WI, USA: Omnipress.

Noorshams, Nima, and Martin J. Wainwright. 2013.“Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees.”*The Journal of Machine Learning Research* 14 (1): 2799–2835.

Ortiz, Joseph, Talfan Evans, and Andrew J. Davison. 2021.“A Visual Introduction to Gaussian Belief Propagation.”*arXiv:2107.02308 [Cs]*, July.

Pearl, Judea. 1982.“Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach.” In*Proceedings of the Second AAAI Conference on Artificial Intelligence*, 133–36. AAAI’82. Pittsburgh, Pennsylvania: AAAI Press.

———. 1986.“Fusion, Propagation, and Structuring in Belief Networks.”*Artificial Intelligence* 29 (3): 241–88.

Zhou, Guangyao, Antoine Dedieu, Nishanth Kumar, Wolfgang Lehrach, Miguel Lázaro-Gredilla, Shrinu Kushagra, and Dileep George. 2023.“PGMax: Factor Graphs for Discrete Probabilistic Graphical Models and Loopy Belief Propagation in JAX.” arXiv.

Question: does anyone ever use a large numerical solver for this? What breaks in that case, apart from the messages growing embarrassingly large?↩︎

William Press,Canonical Correlation Clarified by Singular Value Decomposition

Moststatistical tests are canonical correlation analysis, apparently.(Knapp 1978).

**tl;dr** classic statistical tests are linear models where your goal decide if a coefficient should be regarded as non-zero or not.
Jonas Kristoffer Lindeløv explains this perspective:Common statistical tests are linear models.
FWIW I found that perspective to be a real 💡 moment.

Allen-Zhu, Zeyuan, and Yuanzhi Li. 2017.“Doubly Accelerated Methods for Faster CCA and Generalized Eigendecomposition.” In*PMLR*, 98–106.

Bach, Francis R, and Michael I Jordan. 2002.“Kernel Independent Component Analysis.”*Journal of Machine Learning Research* 3 (July): 48.

Borello, Melinda. 2013.“Standardization and Singular Value Decomposition in Canonical Correlation Analysis.”

Cherry, Steve. 1996.“Singular Value Decomposition Analysis and Canonical Correlation Analysis.”*Journal of Climate* 9 (9): 2003–9.

Cichocki, A., N. Lee, I. V. Oseledets, A.-H. Phan, Q. Zhao, and D. Mandic. 2016.“Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1.”*arXiv:1609.00893 [Cs]*, September.

Ewerbring, L. Magnus, and Franklin T. Luk. 1989.“Canonical Correlations and Generalized SVD: Applications and New Algorithms.”*Journal of Computational and Applied Mathematics*, Special Issue on Parallel Algorithms for Numerical Linear Algebra, 27 (1): 37–52.

Horváth, Lajos, and Piotr Kokoszka. 2012.*Inference for functional data with applications*. Vol. 200. Springer series in statistics. New York: Springer.

Knapp, Thomas R. 1978.“Canonical Correlation Analysis: A General Parametric Significance-Testing System.”*Psychological Bulletin* 85 (2): 410.

Lopez-Paz, David, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Schölkopf. 2014.“Randomized Nonlinear Component Analysis.”*arXiv:1402.0119 [Cs, Stat]*, February.

Ramsay, Jim O., and B.W Silverman. 2005.*Functional Data Analysis*. Springer Series in Statistics. New York: Springer-Verlag.

Witten, Daniela M., Robert Tibshirani, and Trevor Hastie. 2009.“A Penalized Matrix Decomposition, with Applications to Sparse Principal Components and Canonical Correlation Analysis.”*Biostatistics*, January, kxp008.

Witten, Daniela M, and Robert J. Tibshirani. 2009.“Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data.”*Statistical Applications in Genetics and Molecular Biology* 8 (1): 1–27.

Yang, Yanrong, and Guangming Pan. 2015.“Independence Test for High Dimensional Data Based on Regularized Canonical Correlation Coefficients.”*The Annals of Statistics* 43 (2).

Applying acausal graph structure in the challenging environment of a no-holds-barred nonparametric machine learning algorithm such as aneural net or its ilk. I am interested in this because it seems necessary and kind of obvious for handling things likedataset shift, but is often ignored. What is that about?

I do not know at the moment. This is a link salad for now.

Léon Bottou,From Causal Graphs to Causal Invariance:

For many problems, it’s difficult to even attempt drawing a causal graph. While structural causal models provide a complete framework for causal inference, it is often hard to encode known physical laws (such as Newton’s gravitation, or the ideal gas law) as causal graphs. In familiar machine learning territory, how does one model the causal relationships between individual pixels and a target prediction? This is one of the motivating questions behind the paper Invariant Risk Minimization (IRM). In place of structured graphs, the authors elevate invariance to the defining feature of causality.

He commends the Cloudera Fast Forward tutorialCausality for Machine Learning, which is a nice bit of applied work.

There is a fun body of work by what is in my mind the Central European causality-ML think tank. There is some high connectivity between various interesting people: Bernhard Schölkopf, Jonas Peters, Joris Mooij, Stephan Bongers and Dominik Janzing etc. I would love to understand everything that is going on with their outputs, particularly as regards causality in feedback and control systems. Perhaps I should start with the book(Peters, Janzing, and Schölkopf 2017) (Free PDF), or the chatty casual introduction(Schölkopf 2022).

For a good explanation of what they are about by example, seeBernhard Schölkopf: Causality and Exoplanets.

I am particularly curious about their work incausality in continuous fields, e.g.Bongers et al. (2020);Bongers and Mooij (2018);Bongers et al. (2016);Rubenstein et al. (2018).

Künzel et al. (2019) (HTMike McKenna) looks interesting - it is a genericintervention estimator for ML methods (AFAICT this extends the double regression/instrumental variables approach.

… We describe a number of metaalgorithms that can take advantage of any supervised learning or regression method in machine learning and statistics to estimate theconditional average treatment effect (CATE) function. Metaalgorithms build on base algorithms—such as random forests (RFs), Bayesian additive regression trees (BARTs), or neural networks—to estimate the CATE, a function that the base algorithms are not designed to estimate directly. We introduce a metaalgorithm, the X-learner, that is provably efficient when the number of units in one treatment group is much larger than in the other and can exploit structural properties of the CATE function. For example, if the CATE function is linear and the response functions in treatment and control are Lipschitz-continuous, the X-learner can still achieve the parametric rate under regularity conditions. We then introduce versions of the X-learner that use RF and BART as base learners. In extensive simulation studies, the X-learner performs favorably, although none of the metalearners is uniformly the best. In two persuasion field experiments from political science, we demonstrate how our X-learner can be used to target treatment regimes and to shed light on underlying mechanisms.

See alsoMishler and Kennedy (2021). Maybe related,Shalit, Johansson, and Sontag (2017);Shi, Blei, and Veitch (2019).

Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in such systems present a number of major challenges for causal discovery techniques and it is largely unknown which methods perform best for which challenge.

The CauseMe platform provides ground truth benchmark datasets featuring different real data challenges to assess and compare the performance of causal discovery methods. The available benchmark datasets are either generated from synthetic models mimicking real challenges, or are real world data sets where the causal structure is known with high confidence. The datasets vary in dimensionality, complexity and sophistication.

Nisha Muktewar and Chris Wallace,Causality for Machine Learning is the book Bottou recommends on this theme.

For coders, Ben Dickson writes onWhy machine learning struggles with causality.

Cheng Soon Ong recommends Finn Lattimore to me as an important perspective.

biomedia-mira/deepscm: Repository for Deep Structural Causal Models for Tractable Counterfactual Inference(Pawlowski, Coelho de Castro, and Glocker 2020).

Arjovsky, Martin, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. 2020.“Invariant Risk Minimization.” arXiv.

Athey, Susan, and Stefan Wager. 2019.“Estimating Treatment Effects with Causal Forests: An Application.”*arXiv:1902.07409 [Stat]*, February.

Bareinboim, Elias, Juan D. Correa, Duligur Ibeling, and Thomas Icard. 2022.“On Pearl’s Hierarchy and the Foundations of Causal Inference.” In*Probabilistic and Causal Inference: The Works of Judea Pearl*, 1st ed., 36:507–56. New York, NY, USA: Association for Computing Machinery.

Besserve, Michel, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. 2019.“Counterfactuals Uncover the Modular Structure of Deep Generative Models.” In*arXiv:1812.03253 [Cs, Stat]*.

Bishop, J. Mark. 2021.“Artificial Intelligence Is Stupid and Causal Reasoning Will Not Fix It.”*Frontiers in Psychology* 11.

Bongers, Stephan, Patrick Forré, Jonas Peters, Bernhard Schölkopf, and Joris M. Mooij. 2020.“Foundations of Structural Causal Models with Cycles and Latent Variables.”*arXiv:1611.06221 [Cs, Stat]*, October.

Bongers, Stephan, and Joris M. Mooij. 2018.“From Random Differential Equations to Structural Causal Models: The Stochastic Case.”*arXiv:1803.08784 [Cs, Stat]*, March.

Bongers, Stephan, Jonas Peters, Bernhard Schölkopf, and Joris M. Mooij. 2016.“Structural Causal Models: Cycles, Marginalizations, Exogenous Reparametrizations and Reductions.”*arXiv:1611.06221 [Cs, Stat]*, November.

Christiansen, Rune, Niklas Pfister, Martin Emil Jakobsen, Nicola Gnecco, and Jonas Peters. 2020.“A Causal Framework for Distribution Generalization,” June.

Fernández-Loría, Carlos, and Foster Provost. 2021.“Causal Decision Making and Causal Effect Estimation Are Not the Same… and Why It Matters.”*arXiv:2104.04103 [Cs, Stat]*, September.

Friedrich, Sarah, Gerd Antes, Sigrid Behr, Harald Binder, Werner Brannath, Florian Dumpert, Katja Ickstadt, et al. 2020.“Is There a Role for Statistics in Artificial Intelligence?”*arXiv:2009.09070 [Cs]*, September.

Gendron, Gaël, Michael Witbrock, and Gillian Dobbie. 2023.“A Survey of Methods, Challenges and Perspectives in Causality.” arXiv.

Goyal, Anirudh, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, and Bernhard Schölkopf. 2020.“Recurrent Independent Mechanisms.”*arXiv:1909.10893 [Cs, Stat]*, November.

Hartford, Jason, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. 2017.“Deep IV: A Flexible Approach for Counterfactual Prediction.” In*Proceedings of the 34th International Conference on Machine Learning*, 1414–23. PMLR.

Huang, Yu, Zuntao Fu, and Christian L. E. Franzke. 2020.“Detecting Causality from Time Series in a Machine Learning Framework.”*Chaos: An Interdisciplinary Journal of Nonlinear Science* 30 (6): 063116.

Johnson, Matthew J, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. 2016.“Composing Graphical Models with Neural Networks for Structured Representations and Fast Inference.” In*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2946–54. Curran Associates, Inc.

Jordan, Michael I., Yixin Wang, and Angela Zhou. 2022.“Empirical Gateaux Derivatives for Causal Inference.” arXiv.

Kaddour, Jean, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. 2022.“Causal Machine Learning: A Survey and Open Problems.” arXiv.

Karimi, Amir-Hossein, Gilles Barthe, Bernhard Schölkopf, and Isabel Valera. 2021.“A Survey of Algorithmic Recourse: Definitions, Formulations, Solutions, and Prospects.” arXiv.

Karimi, Amir-Hossein, Krikamol Muandet, Simon Kornblith, Bernhard Schölkopf, and Been Kim. 2022.“On the Relationship Between Explanation and Prediction: A Causal View.” arXiv.

Kirk, Robert, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. 2023.“A Survey of Zero-Shot Generalisation in Deep Reinforcement Learning.”*Journal of Artificial Intelligence Research* 76 (January): 201–64.

Kocaoglu, Murat, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vishwanath. 2017.“CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training.”*arXiv:1709.02023 [Cs, Math, Stat]*, September.

Kosoy, Eliza, David M. Chan, Adrian Liu, Jasmine Collins, Bryanna Kaufmann, Sandy Han Huang, Jessica B. Hamrick, John Canny, Nan Rosemary Ke, and Alison Gopnik. 2022.“Towards Understanding How Machines Can Learn Causal Overhypotheses.” arXiv.

Künzel, Sören R., Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. 2019.“Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.”*Proceedings of the National Academy of Sciences* 116 (10): 4156–65.

Lattimore, Finnian Rachel. 2017.“Learning How to Act: Making Good Decisions with Machine Learning.”

Leeb, Felix, Guilia Lanzillotta, Yashas Annadani, Michel Besserve, Stefan Bauer, and Bernhard Schölkopf. 2021.“Structure by Architecture: Disentangled Representations Without Regularization.”*arXiv:2006.07796 [Cs, Stat]*, July.

Li, Lu, Yongjiu Dai, Wei Shangguan, Zhongwang Wei, Nan Wei, and Qingliang Li. 2022.“Causality-Structured Deep Learning for Soil Moisture Predictions.”*Journal of Hydrometeorology* 23 (8): 1315–31.

Locatello, Francesco, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. 2019.“Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations.”*arXiv:1811.12359 [Cs, Stat]*, June.

Locatello, Francesco, Ben Poole, Gunnar Raetsch, Bernhard Schölkopf, Olivier Bachem, and Michael Tschannen. 2020.“Weakly-Supervised Disentanglement Without Compromises.” In*Proceedings of the 37th International Conference on Machine Learning*, 119:6348–59. PMLR.

Louizos, Christos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017.“Causal Effect Inference with Deep Latent-Variable Models.” In*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 6446–56. Curran Associates, Inc.

Lu, Chaochao, Yuhuai Wu, Jośe Miguel Hernández-Lobato, and Bernhard Schölkopf. 2021.“Nonlinear Invariant Risk Minimization: A Causal Approach.”*arXiv:2102.12353 [Cs, Stat]*, June.

Mehta, Raghav, Vítor Albiero, Li Chen, Ivan Evtimov, Tamar Glaser, Zhiheng Li, and Tal Hassner. 2022.“You Only Need a Good Embeddings Extractor to Fix Spurious Correlations.” arXiv.

Melnychuk, Valentyn, Dennis Frauen, and Stefan Feuerriegel. 2022.“Causal Transformer for Estimating Counterfactual Outcomes.” arXiv.

Mishler, Alan, and Edward Kennedy. 2021.“FADE: FAir Double Ensemble Learning for Observable and Counterfactual Outcomes.”*arXiv:2109.00173 [Cs, Stat]*, August.

Mooij, Joris M., Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. 2016.“Distinguishing Cause from Effect Using Observational Data: Methods and Benchmarks.”*Journal of Machine Learning Research* 17 (32): 1–102.

Ng, Ignavier, Zhuangyan Fang, Shengyu Zhu, Zhitang Chen, and Jun Wang. 2020.“Masked Gradient-Based Causal Structure Learning.”*arXiv:1910.08527 [Cs, Stat]*, February.

Ng, Ignavier, Shengyu Zhu, Zhitang Chen, and Zhuangyan Fang. 2019.“A Graph Autoencoder Approach to Causal Structure Learning.” In*Advances In Neural Information Processing Systems*.

Ortega, Pedro A., Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, et al. 2021.“Shaking the Foundations: Delusions in Sequence Models for Interaction and Control.”*arXiv:2110.10819 [Cs]*, October.

Pawlowski, Nick, Daniel Coelho de Castro, and Ben Glocker. 2020.“Deep Structural Causal Models for Tractable Counterfactual Inference.” In*Advances in Neural Information Processing Systems*, 33:857–69. Curran Associates, Inc.

Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf. 2017.*Elements of Causal Inference: Foundations and Learning Algorithms*. Adaptive Computation and Machine Learning Series. Cambridge, Massachuestts: The MIT Press.

Poulos, Jason, and Shuxi Zeng. 2021.“RNN-Based Counterfactual Prediction, with an Application to Homestead Policy and Public Schooling.”*Journal of the Royal Statistical Society Series C: Applied Statistics* 70 (4): 1124–39.

Rakesh, Vineeth, Ruocheng Guo, Raha Moraffah, Nitin Agarwal, and Huan Liu. 2018.“Linked Causal Variational Autoencoder for Inferring Paired Spillover Effects.” In*Proceedings of the 27th ACM International Conference on Information and Knowledge Management*, 1679–82. CIKM ’18. New York, NY, USA: Association for Computing Machinery.

Richardson, Thomas S., and James M. Robins. 2013.“Single World Intervention Graphs (SWIGs): A Unification of the Counterfactual and Graphical Approaches to Causality.” Citeseer.

Roscher, Ribana, Bastian Bohn, Marco F. Duarte, and Jochen Garcke. 2020.“Explainable Machine Learning for Scientific Insights and Discoveries.”*IEEE Access* 8: 42200–42216.

Rotnitzky, Andrea, and Ezequiel Smucler. 2020.“Efficient Adjustment Sets for Population Average Causal Treatment Effect Estimation in Graphical Models.”*Journal of Machine Learning Research* 21 (188): 1–86.

Rubenstein, Paul K., Stephan Bongers, Bernhard Schölkopf, and Joris M. Mooij. 2018.“From Deterministic ODEs to Dynamic Structural Causal Models.” In*Uncertainty in Artificial Intelligence*.

Runge, Jakob, Sebastian Bathiany, Erik Bollt, Gustau Camps-Valls, Dim Coumou, Ethan Deyle, Clark Glymour, et al. 2019.“Inferring Causation from Time Series in Earth System Sciences.”*Nature Communications* 10 (1): 2553.

Schölkopf, Bernhard. 2022.“Causality for Machine Learning.” In*Probabilistic and Causal Inference: The Works of Judea Pearl*, 1st ed., 36:765–804. New York, NY, USA: Association for Computing Machinery.

Schölkopf, Bernhard, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021.“Toward Causal Representation Learning.”*Proceedings of the IEEE* 109 (5): 612–34.

Shalit, Uri, Fredrik D. Johansson, and David Sontag. 2017.“Estimating Individual Treatment Effect: Generalization Bounds and Algorithms.”*arXiv:1606.03976 [Cs, Stat]*, May.

Shi, Claudia, David M. Blei, and Victor Veitch. 2019.“Adapting Neural Networks for the Estimation of Treatment Effects.” In*Proceedings of the 33rd International Conference on Neural Information Processing Systems*, 2507–17. Red Hook, NY, USA: Curran Associates Inc.

Tigas, Panagiotis, Yashas Annadani, Andrew Jesson, Bernhard Schölkopf, Yarin Gal, and Stefan Bauer. 2022.“Interventions, Where and How? Experimental Design for Causal Models at Scale.”*Advances in Neural Information Processing Systems* 35 (December): 24130–43.

Veitch, Victor, and Anisha Zaveri. 2020.“Sense and Sensitivity Analysis: Simple Post-Hoc Analysis of Bias Due to Unobserved Confounding,” March.

Vowels, Matthew J., Necati Cihan Camgoz, and Richard Bowden. 2022.“D’ya Like DAGs? A Survey on Structure Learning and Causal Discovery.”*ACM Computing Surveys* 55 (4): 82:1–36.

Wang, Lijing, Aniruddha Adiga, Jiangzhuo Chen, Adam Sadilek, Srinivasan Venkatramanan, and Madhav Marathe. 2022.“CausalGNN: Causal-Based Graph Neural Networks for Spatio-Temporal Epidemic Forecasting.”*Proceedings of the AAAI Conference on Artificial Intelligence* 36 (11): 12191–99.

Wang, Sifan, Shyam Sankaran, and Paris Perdikaris. 2022.“Respecting Causality Is All You Need for Training Physics-Informed Neural Networks.” arXiv.

Wang, Xingqiao, Xiaowei Xu, Weida Tong, Ruth Roberts, and Zhichao Liu. 2021.“InferBERT: A Transformer-Based Causal Inference Framework for Enhancing Pharmacovigilance.”*Frontiers in Artificial Intelligence* 4.

Wang, Yixin, and Michael I. Jordan. 2021.“Desiderata for Representation Learning: A Causal Perspective.”*arXiv:2109.03795 [Cs, Stat]*, September.

Wang, Yuhao, Liam Solus, Karren Dai Yang, and Caroline Uhler. 2017.“Permutation-Based Causal Inference Algorithms with Interventions,” May.

Yang, Mengyue, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. 2020.“CausalVAE: Disentangled Representation Learning via Neural Structural Causal Models.”*arXiv:2004.08697 [Cs, Stat]*, July.

Yoon, Jinsung. n.d.“E-RNN: Entangled Recurrent Neural Networks for Causal Prediction.”

Zhang, Kun, Mingming Gong, Petar Stojanov, Biwei Huang, Qingsong Liu, and Clark Glymour. 2020.“Domain Adaptation as a Problem of Inference on Graphical Models.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Zhang, Rui, Masaaki Imaizumi, Bernhard Schölkopf, and Krikamol Muandet. 2021.“Maximum Moment Restriction for Instrumental Variable Regression.”*arXiv:2010.07684 [Cs]*, February.

Constructing a backward (P)DE which effectively gives us the gradients of the forward (P)DE. A trick in automatic differentiation which happens to be useful in differentiating likelihood (or other functions) of time-evolving systems. e.g.(Errico 1997;Kidger, Chen, and Lyons 2021;Kidger et al. 2020;Li et al. 2020;Rackauckas et al. 2018;Stapor, Fröhlich, and Hasenauer 2018;Cao et al. 2003).

Cao, Y., S. Li, L. Petzold, and R. Serban. 2003.“Adjoint Sensitivity Analysis for Differential-Algebraic Equations: The Adjoint DAE System and Its Numerical Solution.”*SIAM Journal on Scientific Computing* 24 (3): 1076–89.

Carpenter, Bob, Matthew D. Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt. 2015.“The Stan Math Library: Reverse-Mode Automatic Differentiation in C++.”*arXiv Preprint arXiv:1509.07164*.

Errico, Ronald M. 1997.“What Is an Adjoint Model?”*Bulletin of the American Meteorological Society* 78 (11): 2577–92.

Gahungu, Paterne, Christopher W. Lanyon, Mauricio A. Álvarez, Engineer Bainomugisha, Michael Thomas Smith, and Richard David Wilkinson. 2022.“Adjoint-Aided Inference of Gaussian Process Driven Differential Equations.” In.

Giles, Mike B. 2008.“Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation.” In*Advances in Automatic Differentiation*, edited by Christian H. Bischof, H. Martin Bücker, Paul Hovland, Uwe Naumann, and Jean Utke, 64:35–44. Berlin, Heidelberg: Springer Berlin Heidelberg.

Innes, Michael. 2018.“Don’t Unroll Adjoint: Differentiating SSA-Form Programs.”*arXiv:1810.07951 [Cs]*, October.

Ionescu, Catalin, Orestis Vantzos, and Cristian Sminchisescu. 2016.“Training Deep Networks with Structured Layers by Matrix Backpropagation.” arXiv.

Johnson, Steven G. 2012.“Notes on Adjoint Methods for 18.335,” 6.

Kavvadias, I. S., E. M. Papoutsis-Kiachagias, and K. C. Giannakoglou. 2015.“On the Proper Treatment of Grid Sensitivities in Continuous Adjoint Methods for Shape Optimization.”*Journal of Computational Physics* 301 (November): 1–18.

Kidger, Patrick, Ricky T. Q. Chen, and Terry J. Lyons. 2021.“‘Hey, That’s Not an ODE’: Faster ODE Adjoints via Seminorms.” In*Proceedings of the 38th International Conference on Machine Learning*, 5443–52. PMLR.

Kidger, Patrick, James Morrill, James Foster, and Terry Lyons. 2020.“Neural Controlled Differential Equations for Irregular Time Series.”*arXiv:2005.08926 [Cs, Stat]*, November.

Li, Xuechen, Ting-Kam Leonard Wong, Ricky T. Q. Chen, and David Duvenaud. 2020.“Scalable Gradients for Stochastic Differential Equations.” In*International Conference on Artificial Intelligence and Statistics*, 3870–82. PMLR.

Margossian, Charles C., Aki Vehtari, Daniel Simpson, and Raj Agrawal. 2020.“Hamiltonian Monte Carlo Using an Adjoint-Differentiated Laplace Approximation: Bayesian Inference for Latent Gaussian Models and Beyond.”*arXiv:2004.12550 [Stat]*, October.

Mitusch, Sebastian K., Simon W. Funke, and Jørgen S. Dokken. 2019.“Dolfin-Adjoint 2018.1: Automated Adjoints for FEniCS and Firedrake.”*Journal of Open Source Software* 4 (38): 1292.

Papoutsis-Kiachagias, E. M., and K. C. Giannakoglou. 2016.“Continuous Adjoint Methods for Turbulent Flows, Applied to Shape and Topology Optimization: Industrial Applications.”*Archives of Computational Methods in Engineering* 23 (2): 255–99.

Papoutsis-Kiachagias, E. M., N. Magoulas, J. Mueller, C. Othmer, and K. C. Giannakoglou. 2015.“Noise Reduction in Car Aerodynamics Using a Surrogate Objective Function and the Continuous Adjoint Method with Wall Functions.”*Computers & Fluids* 122 (November): 223–32.

Papoutsis-Kiachagias, Evangelos. 2013.“Adjoint Methods for Turbulent Flows, Applied to Shape or Topology Optimization and Robust Design.”

Rackauckas, Christopher, Yingbo Ma, Vaibhav Dixit, Xingjian Guo, Mike Innes, Jarrett Revels, Joakim Nyberg, and Vijay Ivaturi. 2018.“A Comparison of Automatic Differentiation and Continuous Sensitivity Analysis for Derivatives of Differential Equation Solutions.”*arXiv:1812.01892 [Cs]*, December.

Stapor, Paul, Fabian Fröhlich, and Jan Hasenauer. 2018.“Optimization and Uncertainty Analysis of ODE Models Using 2nd Order Adjoint Sensitivity Analysis.”*bioRxiv*, February, 272005.

Suppose we are keen to devise yet another method that will do clever things toaugment PDE solvers with ML somehow.
To that end it would be nice to have a PDE solver that was not a completely black box but which we could interrogate for useful gradients to compare
Obviously allPDE solvers*use* gradient information, but only some of them expose that to us as users as a first calss feature.
e.g. MODFLOW will give me a solution filed but not the gradients of the field that were used to calculate that gradient.
In ML toolkits accessing this information is often easy; may of them supply an API to accessadjoints.

OTOH, there is a lot of sophisticated work done by PDE solvers that is hard for ML toolkits to recreate. That is why PDE solvers are a thing.

Classic PDE solvers which combine gradient-based (with respect to inputs) inference and ML outputs do exist. Differentiable solvers for PDEs combine a high fidelity to the supposed laws of physics governing the system under investigation with the potential to do some kind of inference through them. Obviously they involve various approximations to the “ground truth” in practice; reality must be discretized and simplified to fit in the machine. However, the kinds of simplications that these solvers make are by convention unthreatening; we generally are not too concerned about the discretisation implied by a finite-element method mesh, or the cells in a finite difference, or the waves in a spectral method, which we can often prove are “not too far” from what we would find in a physical system which truly followed the laws we hope it does. On the other hand, we might be suspicious that the laws of physics we do know are truly a complete characterisation of the system (pro tip: they are not) so these solvers might give us undue confidence that their idealized fidelity guarantees that they will be accurate to solve the real world, not mathematical idealisations of it. Further, these solvers often buy fidelity at the price of speed. Empirically, we discover that we can get nearly-as-good a result from an ML method but far faster than a classic PDE solver.

There is a role for both kinds of approaches; in fact there is a burgeoning industry in stitching them together, and playing of relative strengths of each.

TBD: discuss how sufficiently flexible PDE solvers allow us to solve the adjoint equation and “manually” differentiate

Good. For now, some alternatives.

PhiFlow: A differentiable PDE solving framework for machine learning(Holl et al. 2020).

I use this a lot, so it has itsown notebook.

Unremarkable name, looks handy though. Implements both projection methods and spectral methods, and different variations of Crank-Nicholson for fluid-dynamical models. Seems to imply periodic boundary conditions?

OrchardLANL/DPFEHM.jl: DPFEHM: A Differentiable Subsurface Flow Simulator

DPFEHM is aJulia module that includes differentiable numerical models with a focus on fluid flow and transport in the Earth’s subsurface. Currently it supports the groundwater flow equations (single phase flow), Richards equation (air/water), the advection-dispersion equation, and the 2d wave equation.

Does not seem to support CUDA well but is nifty. Use in e.g.Pachalieva et al. (2022).Inverse solver example.NN example.

Trixi.jl is a numerical simulation framework for hyperbolic conservation laws written in Julia. A key objective for the framework is to be useful to both scientists and students. Therefore, next to having an extensible design with a fast implementation, Trixi is focused on being easy to use for new or inexperienced users, including the installation and postprocessing procedures. Its features include:

- 1D, 2D, and 3D simulations online/quad/hex/simplex meshes
- Cartesian and curvilinear meshes
- Conforming and non-conforming meshes
- Structured and unstructured meshes
- Hierarchical quadtree/octree grid with adaptive mesh refinement
- Forests of quadtrees/octrees withp4est viaP4est.jl
- High-order accuracy in space in time
- Discontinuous Galerkin methods
- Kinetic energy-preserving and entropy-stable methods based on flux differencing
- Entropy-stable shock capturing
- Positivity-preserving limiting
- Finite difference summation by parts (SBP) methods
- Compatible with theSciML ecosystem for ordinary differential equations
- Explicit low-storage Runge-Kutta time integration
- Strong stability preserving methods
- CFL-based and error-based time step control
- Native support for differentiable programming
- Forward mode automatic differentiation viaForwardDiff.jl
- Periodic and weakly-enforced boundary conditions

TorchPhysics is a Python library of (mesh-free) deep learning methods to solve differential equations. You can use TorchPhysics e.g. to

- solve ordinary and partial differential equations
- train a neural network to approximate solutions for different parameters
- solve inverse problems and interpolate external data

Much NN automation in this library but the NN part is bare-bones FEM stuff.

Using Bayesian probabilistic numerics? TBD. Seetornadox(Krämer et al. 2021).

ADCME is suitable for conducting inverse modeling in scientific computing; specifically, ADCME targets physics informed machine learning, which leverages machine learning techniques to solve challenging scientific computing problems. The purpose of the package is to:

- provide differentiable programming framework for scientific computing based on TensorFlow automatic differentiation (AD) backend;
- adapt syntax to facilitate implementing scientific computing, particularly for numerical PDE discretization schemes;
- supply missing functionalities in the backend (TensorFlow) that are important for engineering, such as sparse linear algebra, constrained optimization, etc.
Applications include

- physics informed machine learning (a.k.a., scientific machine learning, physics informed learning, etc.)
- coupled hydrological and full waveform inversion
- constitutive modeling in solid mechanics
- learning hidden geophysical dynamics
- parameter estimation in stochastic processes
The package inherits the scalability and efficiency from the well-optimized backend TensorFlow. Meanwhile, it provides access to incorporate existing C/C++ codes via the custom operators. For example, some functionalities for sparse matrices are implemented in this way and serve as extendable “plugins” for ADCME.

OpenFOAM (for “Open-source Field Operation And Manipulation”) is a C++ toolbox for the development of customized numerical solvers, and pre-/post-processing utilities for the solution of continuum mechanics problems, most prominently including computational fluid dynamics (CFD).

The adjoint optimisation takes some digging to discover.
Keyword:`adjointOptimisationFoam`

(Papoutsis-Kiachagias et al. 2021).

Julaifem is an umbrella organisation supporting julia-backed FEM solvers. The documentation is tricksy, but check out the examples,Supported solvers listed here. I assume these are all differentiable, since that is a selling point of theSciML.jl ecosystem they spring from, but I have not checked. The emphasis seems to be upon cluster-distributed solutions at scale.

Also seems to bea friendly PDE solver, lacking in GPU support. However, it does have an interface topytorch,barkm/torch-fenics on the CPU to provide differentiability with respect to parameters.

dolfin-adjoint(Mitusch, Funke, and Dokken 2019):

The dolfin-adjoint project automatically

derives the discrete adjoint and tangent linear modelsfrom a forward model written in the Python interface toFEniCS andFiredrakeThese adjoint and tangent linear models are key ingredients in many important algorithms, such as data assimilation, optimal control, sensitivity analysis, design optimisation, and error estimation. Such models have made an enormous impact in fields such as meteorology and oceanography, but their use in other scientific fields has been hampered by the great practical difficulty of their derivation and implementation. In his recent bookNaumann (2011) states that

[T]he automatic generation of optimal (in terms of robustness and efficiency) adjoint versions of large-scale simulation code is

one of the great open challenges in the field of High-Performance Scientific Computing.

The dolfin-adjoint project aims to solve this problemfor the case where the model is implemented in the Python interface to FEniCS/Firedrake.

This provides the AD backend tobarkm/torch-fenics which integrates withpytorch.

HELYX is a unified, off-the-shelf CFD software product compatible with most Linux and Windows platforms, including high-performance computing systems. In addition to the base software components delivered for installation (HELYX-GUI and HELYX-Core), the package also incorporates an extensive set of ancillary services to facilitate the deployment and usage of the software in any working environment.

HELYX features an advanced hex-dominant automatic mesh algorithm with polyhedra support which can run in parallel to generate large computational grids. The solver technology is based on the standard finite-volume approach, covering a wide range of physical models: single- and multi-phase turbulent flows (RANS, URANS, DES, LES), thermal flows with natural/forced convection, thermal/solar radiation, incompressible and compressible flow solutions, etc. In addition to these, we have developed a Generalised Internal Boundaries (GIB) method to support complex boundary motions inside the finite-volume mesh. The standard capabilities of HELYX can also be expanded with the HELYX-ADD-ONS extension modules to cover more specialised applications.

HELYX-Adjoint is a continuous adjoint CFD solver for topology and shape optimisation developed by ENGYS based on the extensive theoretical work of Dr. Carsten Othmer of Volkswagen AG, Corporate Research. The technology has been extensively proven and validated through productive use in real-life design applications, including: vehicle external aerodynamics, in-cylinder flows, HVAC ducts, turbomachinery components, battery cooling channels, among others.

mantaflow - an extensible framework for fluid simulation:

mantaflow is an open-source framework targeted at fluid simulation research in Computer Graphics and Machine Learning. Its parallelized C++ solver core, python scene definition interface and plugin system allow for quickly prototyping and testing new algorithms. A wide range of Navier-Stokes solver variants are included. It’s very versatile, and allows coupling and import/export with deep learning frameworks (e.g., tensorflow via numpy) or standalone compilation as matlab plugin. Mantaflow also serves as the simulation engine in Blender.

Feature list:

The framework can be used with or without GUI on Linux, MacOS and Windows. Here is an incomplete list of features implemented so far:

- Eulerian simulation using MAC Grids, PCG pressure solver and MacCormack advection
- Flexible particle systems
- FLIP simulations for liquids
- Surface mesh tracking
- Free surface simulations with levelsets, fast marching
- Wavelet and surface turbulence
- K-epsilon turbulence modeling and synthesis
- Maya and Blender export for rendering

Mantaflow’s particular selling point is producing stunning 3d animations as an output. It is not widely used for inference in practice; people, including the authors, seem to prefer Phiflow for that end.

DeepXDE is the reference solver implementation for PINN and DeepONet.(Lu et al. 2021)

Use DeepXDE if you need a deep learning library that

- solves forward and inverse partial differential equations (PDEs) via physics-informed neural network (PINN),
- solves forward and inverse integro-differential equations (IDEs) via PINN,
- solves forward and inverse fractional partial differential equations (fPDEs) via fractional PINN (fPINN),
- approximates functions from multi-fidelity data via multi-fidelity NN (MFNN),
- approximates nonlinear operators via deep operator network (DeepONet),
- approximates functions from a dataset with/without constraints.

You might need to moderate your expectations a little.
I did, after that bold description.
This is an impressive library, but as covered above, some of the types of problems that it can solve are more limited than one might hope upon reading the description.
Think of it as a neural network library that handles*certain* PDE calculations and you will not go too far astray.

TenFEM offers a small selection of differentiable FEM solvers fprTensorflow.

“Sparse simulator” Tai Chi(Hu et al. 2019) is presumably also able to solve PDEs? 🤷🏼♂️ If so that would be nifty because it is alsodifferentiable. I suspect it is more of agraph network approach.

Holl, Philipp, Vladlen Koltun, Kiwon Um, and Nils Thuerey. 2020.“Phiflow: A Differentiable PDE Solving Framework for Deep Learning via Physical Simulations.” In*NeurIPS Workshop*.

Hu, Yuanming, Tzu-Mao Li, Luke Anderson, Jonathan Ragan-Kelley, and Frédo Durand. 2019.“Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures.”*ACM Transactions on Graphics* 38 (6): 1–16.

Kochkov, Dmitrii, Jamie A. Smith, Ayya Alieva, Qing Wang, Michael P. Brenner, and Stephan Hoyer. 2021.“Machine Learning–Accelerated Computational Fluid Dynamics.”*Proceedings of the National Academy of Sciences* 118 (21).

Krämer, Nicholas, Nathanael Bosch, Jonathan Schmidt, and Philipp Hennig. 2021.“Probabilistic ODE Solutions in Millions of Dimensions.” arXiv.

Lu, Lu, Xuhui Meng, Zhiping Mao, and George Em Karniadakis. 2021.“DeepXDE: A Deep Learning Library for Solving Differential Equations.”*SIAM Review* 63 (1): 208–28.

Mitusch, Sebastian K., Simon W. Funke, and Jørgen S. Dokken. 2019.“Dolfin-Adjoint 2018.1: Automated Adjoints for FEniCS and Firedrake.”*Journal of Open Source Software* 4 (38): 1292.

Naumann, Uwe. 2011.*The Art of Differentiating Computer Programs: An Introduction to Algorithmic Differentiation*. Society for Industrial and Applied Mathematics.

Pachalieva, Aleksandra, Daniel O’Malley, Dylan Robert Harp, and Hari Viswanathan. 2022.“Physics-Informed Machine Learning with Differentiable Programming for Heterogeneous Underground Reservoir Pressure Management.” arXiv.

Papoutsis-Kiachagias, Evangelos, Konstantinos Gkaragkounis, Andreas-Stefanos Margetis, Themis Skamagkis, Varvara Asouti, and Kyriakos Giannakoglou. 2021.“adjointOptimisationFoam: An Openfoam-Based Framework for Adjoint-Assisted Optimisation.” In*14th International Conference on Evolutionary and Deterministic Methods for Design, Optimization and Control*, 191–206. Athens, Greece: Institute of Structural Analysis and Antiseismic Research National Technical University of Athens.

IPython was the first mass-market interactivepython upgrade. The python-specific part ofjupyter, which can also run without jupyter. Long story. But think of it as a REPL, a CLI-style execution environment, that is a little friendlier than naked python and has colourisation and autocomplete and such. And also is complex in confusing ways.

Here are some notes for its care and feeding.

To configure python we need a config profile

`ipython profile create`

IPython config is per-default located in

`(ipython locate profile default)/ipython_config.py`

This is all built upon`ipython`

so you invoke the debugger ipython-style,
specifically:

```
from IPython.core.debugger import Tracer; Tracer()() # < 5.1
from IPython.core.debugger import set_trace; set_trace() # >= v5.1
```

See alsogeneric python debugging.

Check out theipython rich display protocol
which allows us to render objects as arbitrary graphics.
This extends the`__str__()`

and`__repr__()`

Methods from ordinary python.

Rich display is especially useful in thejupyter frontends, which permit graphics. Some examples:

- nbviewer examples of how to use that are helpful.
- lovely-numpy displays many array types (e.g. numpy, pytorch etc) gracefully.
- The Ipython display protocol was what I used to create
`latex_fragment`

which can display arbitrary latex inline.

How to display my own things nicely? Thedisplay API docs
explain that you should implement methods on my objects such as, e.g.,`_repr_svg_`

.

This is how I the`latex_fragment`

library works, for example:

```
def _figure_data(self, format):
fig, ax = plt.subplots()
ax.plot(self.data, 'o')
ax.set_title(self._repr_latex_())
data = print_figure(fig, format)
# We MUST close the figure, otherwise
# IPython’s display machinery
# will pick it up and send it as output,
# resulting in double display
plt.close(fig)
return data
# Here we define the special repr methods
# that provide the IPython display protocol
# Note that for the two figures, we cache
# the figure data once computed.
def _repr_png_(self):
if self._png_data is None:
self._png_data = self._figure_data('png')
return self._png_data
```

IPython’s history obsession points out that big memory allocations can hang around in jupyter (well, ipython) for quite a while.

So the output from line 12 can be obtained as

`_12`

,`Out[12]`

or`_oh[12]`

. If you accidentally overwrite the Out variable you can recover it by typing`Out=_oh`

at the prompt.This system obviously can potentially put heavy memory demands on your system, since it prevents Python’s garbage collector from removing any previously computed results. You can control how many results are kept in memory with the configuration option

`InteractiveShell.cache_size`

. If you set it to 0, output caching is disabled. You can also use the`%reset`

and`%xdel`

magics to clear large items from memory.

Workarounds.

- in jupyter, execute
`%config ZMQInteractiveShell.cache_size = 0`

; although this does not work in alljupyter font ends - editthe config file and add
`ZMQInteractiveShell.cache_size=0`

.

The ecosystem that support tab-completion is fragile and lackadaisical.Most recently for me. autocomplete was broken because the sensitive dependencies of`jedi`

are managed by cowboys.
The fix for that particular version was

`conda install jedi==0.17.2`

Interpretations and tricks for matrix square roots. TBC

Perturbations of low rank matrices have non-low-rank roots, but there exist efficient algorithms for findimg them at least.Fasi, Higham, and Liu (2023):

We consider the problem of computing the square root of a perturbation of the scaled identity matrix,\(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\), where\(\mathrm{U}\) and\(\mathrm{V}\) are\(n \times k\) matrices with\(k \leq n\). This problem arises in various applications, including computer vision and optimization methods for machine learning. We derive a new formula for the\(p\) th root of\(\mathrm{A}\) that involves a weighted sum of powers of the\(p\) th root of the\(k \times k\) matrix\(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\). This formula is particularly attractive for the square root, since the sum has just one term when\(p=2\). We also derive a new class of Newton iterations for computing the square root that exploit the low-rank structure.

Their method works for low-rank-plus-diagonal matrices without negative eigenvalues.

Theorem: Let\(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with\(k \leq n\) and assume that\(\mathrm{V}^{H} \mathrm{U}\) is nonsingular. Let\(f\) be defined on the spectrum of\(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\), and if\(k=n\) let\(f\) be defined at\(\alpha\). Then\[\quad f(\mathrm{A})=f(\alpha) \mathrm{I}_n+\mathrm{U}\left(\mathrm{V}^{H} \mathrm{U}\right)^{-1}\left(f\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)-f(\alpha) \mathrm{I}_k\right) \mathrm{V}^{H}.\] The theorem says two things: that\(f(\mathrm{A})\), like\(\mathrm{A}\), is a perturbation of rank at most\(k\) of the identity matrix and that\(f(\mathrm{A})\) can be computed by evaluating\(f\) and the inverse at two\(k \times k\) matrices.

I would call this a generalized Woodbury formula, and I think it is pretty cool; which tells you something about my current obsession profile. Anyway, they use it to discover the following:

Let\(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with\(k \leq n\) have full rank and let the matrix\(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\) have no eigenvalues on\(\mathbb{R}^{-}\). Then for any integer\(p \geq 1\),\[ \mathrm{A}^{1 / p}=\alpha^{1 / p} \mathrm{I}_n+\mathrm{U}\left(\sum_{i=0}^{p-1} \alpha^{i / p} \cdot\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)^{(p-i-1) / p}\right)^{-1} \mathrm{V}^{H} \]

and in particular

Let\(\mathrm{U}, \mathrm{V} \in \mathbb{C}^{n \times k}\) with\(k \leq n\) have full rank and let the matrix\(\mathrm{A}=\alpha \mathrm{I}_n+\mathrm{U} \mathrm{V}^{H}\) have no eigenvalues on\(\mathbb{R}^{-}\). Then\[ \mathrm{A}^{1 / 2}=\alpha^{1 / 2} \mathrm{I}_n+\mathrm{U}\left(\left(\alpha \mathrm{I}_k+\mathrm{V}^{H} \mathrm{U}\right)^{1 / 2}+\alpha^{1 / 2} \mathrm{I}_k\right)^{-1} \mathrm{V}^{H}. \]

They also derive an explicit gradient step for calculating it, namely Denman-Beaver iteration:

The (scaled) DB iteration is\[ \begin{aligned} \mathrm{X}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{X}_i+\mu_i^{-1} \mathrm{Y}_i^{-1}\right), & \mathrm{X}_0=\mathrm{A}, \\ \mathrm{Y}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{Y}_i+\mu_i^{-1} \mathrm{X}_i^{-1}\right), & \mathrm{Y}_0=\mathrm{I}, \end{aligned} \] where the positive scaling parameter\(\mu_i \in \mathbb{R}\) can be used to accelerate the convergence of the method in its initial steps. The choice\(\mu_i=1\) yields the unscaled DB method, for which\(\mathrm{X}_i\) and\(\mathrm{Y}_i\) converge quadratically to\(\mathrm{A}^{1 / 2}\) and\(\mathrm{A}^{-1 / 2}\), respectively.

… for\(i \geq 0\) the iterates\(\mathrm{X}_i\) and\(\mathrm{Y}_i\) can be written in the form\[ \begin{aligned} & \mathrm{X}_i=\beta_i \mathrm{I}_n+\mathrm{U} \mathrm{B}_i \mathrm{V}^{H}, \quad \beta_i \in \mathbb{C}, \quad \mathrm{B}_i \in \mathbb{C}^{k \times k}, \\ & \mathrm{Y}_i=\gamma_i \mathrm{I}_n+\mathrm{U} \mathrm{C}_i \mathrm{V}^{H}, \quad \gamma_i \in \mathbb{C}, \quad \mathrm{C}_i \in \mathbb{C}^{k \times k} . \\ & \end{aligned}\]

This gets us a computational speedup, although of a rather complicated kind. For a start its constant factor is very favourable compared to the naive approach, but it also has a somewhat favourable scaling with\(n\), being less-than-cubic although more than quadratic depending on some optimisation convergence rates, which depends both on the problem and upon optimal selection of\(\beta_i,\gamma_i\), which they give a recipe for but it gets kinda complicated and engineering-y.

Anyway, let us suppose\(\mathrm{U}=\mathrm{V}=\mathrm{Z}\) and replay that. Then

\[ \mathrm{A}^{1 / 2}=\alpha^{1 / 2} \mathrm{I}_n+\mathrm{Z}\left(\left(\alpha \mathrm{I}_k+\mathrm{Z}^{H} \mathrm{Z}\right)^{1 / 2}+\alpha^{1 / 2} \mathrm{I}_k\right)^{-1} \mathrm{Z}^{H}. \]

The Denman-Beaver step becomes

\[ \begin{aligned} \mathrm{X}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{X}_i+\mu_i^{-1} \mathrm{Y}_i^{-1}\right), & \mathrm{X}_0=\mathrm{A}, \\ \mathrm{Y}_{i+1} & =\frac{1}{2}\left(\mu_i \mathrm{Y}_i+\mu_i^{-1} \mathrm{X}_i^{-1}\right), & \mathrm{Y}_0=\mathrm{I}, \end{aligned} \]

This looked useful but now I note that this is giving us a full-size square root, rather than a low-rank square root, which is not helpful to me.

Del Moral, Pierre, and Angele Niclas. 2018.“A Taylor Expansion of the Square Root Matrix Functional.” arXiv.

Dolcetti, Alberto, and Donato Pertici. 2020.“Real Square Roots of Matrices: Differential Properties in Semi-Simple, Symmetric and Orthogonal Cases.” arXiv.

Fasi, Massimiliano, Nicholas J. Higham, and Xiaobo Liu. 2023.“Computing the Square Root of a Low-Rank Perturbation of the Scaled Identity Matrix.”*SIAM Journal on Matrix Analysis and Applications* 44 (1): 156–74.

Kessy, Agnan, Alex Lewin, and Korbinian Strimmer. 2018.“Optimal Whitening and Decorrelation.”*The American Statistician* 72 (4): 309–14.

Minka, Thomas P. 2000.*Old and new matrix algebra useful for statistics*.

Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012.“The Matrix Cookbook.”

Pleiss, Geoff, Martin Jankowiak, David Eriksson, Anil Damle, and Jacob Gardner. 2020.“Fast Matrix Square Roots with Applications to Gaussian Processes and Bayesian Optimization.”*Advances in Neural Information Processing Systems* 33.

Song, Yue, Nicu Sebe, and Wei Wang. 2022.“Fast Differentiable Matrix Square Root.” In.

`jax`

is a successor to classic python+numpy`autograd`

.
It includes various code optimisation, jit-compilations,differentiating and vectorizing.

So, a numerical library with certain high performance machine-learning affordances.
Note, it is not a deep learning framework*per se*, but rather the producer species at lowest trophic level of a deep learning ecosystem.
For information frameworks built upon it, (or I suppose, in this metaphor*predator species*) read on to later sections.

The official pitch:

JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.

What’s new is that JAX uses XLA to compile and run your NumPy programs on GPUs and TPUs. Compilation happens under the hood by default, with library calls getting just-in-time compiled and executed. But JAX also lets you just-in-time compile your own Python functions into XLA-optimized kernels using a one-function API,

`jit`

. Compilation and automatic differentiation can be composed arbitrarily, so you can express sophisticated algorithms and get maximal performance without leaving Python.Dig a little deeper, and you’ll see that JAX is really an extensible system for composable function transformations. Both

`grad`

and`jit`

are instances of such transformations. Another is`vmap`

for automatic vectorization, with more to come.This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying it out, reporting bugs, and letting us know what you think!

AFAICT the conda installation command is

`conda install -c conda-forge jaxlib`

- You don’t know jax is a popular intro.
- n2cholas/awesome-jax

- Shailesh Kumar,Wavelet Transforms in Python with Google JAX
- CR.Sparse — A JAX/XLA based library of accelerated models and algorithms for inverse problems in sparse representation and compressive sensing.

Jax has idioms that are not obvious. For me it was not clear how to use batch vectorizing and functional-style application of structures:

Sabrina J. Mielke,From PyTorch to JAX: towards neural net frameworks that purify stateful code

Maybe you decided to look at libraries like

`flax`

,`trax`

, or`haiku`

and what you see at least in the ResNet examples looks not too dissimilar from any other framework: define some layers, run some trainers… but what is it that actually happens there? What’s the route from these tiny numpy functions to training big hierarchical neural nets?That’s the niche this post is trying to fill. We will:

- quickly recap a stateful LSTM-LM implementation in a tape-based gradient framework, specifically PyTorch,
- see how PyTorch-style coding relies on mutating state, learn about mutation-free
purefunctions and build (pure) zappy one-liners in JAX,- step-by-step go from individual parameters to medium-size modules by registering them as pytree nodes,
- combat growing pains by building fancy scaffolding, and controlling context to extract initialized parameters purify functions and
- realize that we could get that easily in a framework like DeepMind’s
`haiku`

using its`transform`

mechanism.

One thing I see often in examples is

```
from jax.config import config
config.enable_omnistaging()
```

Do I need to care about it?**tl;dr** omnistaging is good and necessary and also switched on by default on recent jax, so that line is simply being careful and likely unneeded.

OK, elegant linear algebra is all well and good, but can I also have some standard neural network libraries with convnets and dropout layers and SGD all that standard machinery? Yes! In fact I can have a huge menu of very similar libraries, and now all the computation time I saved by using jax must be spent on working out which flavour of jax libraries I actually want. That sounds snarky because it is; I’m not a huge fan of any of these frameworks. All of them have friction between the pure and beautiful functional style of jax code and the object-oriented conveniences that deep learning people are used to. the exception is Stax, but that looks vaguely abandoned. Equinox claims to address this problem but I have not tested it yet.

Flax was I think the*de facto* standard deep learning library for jax, and may be still.

Flax is a high-performance neural network library for JAX that is

designed for flexibility: Try new forms of training by forking an example and by modifying the training loop, not by adding features to a framework.Flax is being developed in close collaboration with the JAX team and comes with everything you need to start your research, including:

Neural network API(`flax.linen`

): Dense, Conv, {Batch|Layer|Group} Norm, Attention, Pooling, {LSTM|GRU} Cell, Dropout

Utilities and patterns: replicated training, serialization and checkpointing, metrics, prefetching on device

Educational examplesthat work out of the box: MNIST, LSTM seq2seq, Graph Neural Networks, Sequence Tagging

Fast, tuned large-scale end-to-end examples: CIFAR10, ResNet on ImageNet, Transformer LM1b

I think the google brain team has moved on but this now has momentum? e.g. Why do some modules assume batching and other not? No hints) but it more or less can be cargo-culted and you can ignore the quirks except sometimes.

See also the WIPdocumentation notebooks
Those answered some of my questions, but I still have questions left over due to various annoying rough edges and non-obvious gotchas.
For example, if you miss a parameter needed for a given model, the error is`FilteredStackTrace: AssertionError: Need PRNG for "params"`

.

There are some good examples in the repository.

~~I have the vague feeling that this will be abandoned for a more polished interface soon.~~
Still seems actively developed.

Kidger and Garcia (2021):

JAX and PyTorch are two popular Python autodifferentiation frameworks. JAX is based around pure functions and functional programming. PyTorch has popularised the use of an object-oriented (OO) class-based syntax for defining parameterised functions, such as neural networks. That this seems like a fundamental difference means current libraries for building parameterised functions in JAX have either rejected the OO approach entirely (Stax) or have introduced OO-to-functional transformations, multiple new abstractions, and been limited in the extent to which they integrate with JAX (Flax, Haiku, Objax). Either way this OO/ functional difference has been a source of tension. Here, we introduce

`Equinox`

, a small neural network library showing how a PyTorch-like class-based approach may be admitted without sacrificing JAX-like functional programming. We provide two main ideas. One: parameterised functions are themselves represented as`PyTrees`

, which means that the parameterisation of a function is transparent to the JAX framework. Two: we filter a PyTree to isolate just those components that should be treated when transforming (`jit`

,`grad`

or`vmap`

-ing) a higher-order function of a parameterised function – such as a loss function applied to a model. Overall Equinox resolves the above tension without introducing any new program- matic abstractions: only PyTrees and transformations, just as with regular JAX. Equinox is available athttps://github.com/patrick-kidger/equinox

Eqyuinox is part ofPatrick Kidger’s recommendations:

Neural networks:Equinox.

Numerical differential equation solvers:Diffrax.

Computer vision models:Eqxvision.

SymPy↔︎JAX conversion; train symbolic expressions via gradient descent:sympy2jax.

Rob Salomone recommendsstax which ships with jax. It has an alarming disclaimer:

You likely do not mean to import this module! Stax is intended as an example library only. There are a number of other much more fully-featured neural network libraries for JAX…

Documentation seems absent. Here are some examples of stax in action

Unique value proposition: stax attempts to be stay close to Jax’s functional style, unlike the more object-oriented contenders.

Deepmind-flavoured.Haiku Documentation

Haiku is a simple neural network library for JAX that enables users to use familiar object-oriented programming models while allowing full access to JAX's pure function transformations. Haiku is designed to make the common things we do such as managing model parameters and other model state simpler and similar in spirit to theSonnet library that has been widely used across DeepMind. It preserves Sonnet's module-based programming model for state management while retaining access to JAX's function transformations. Haiku can be expected to compose with other libraries and work well with the rest of JAX.

Looks unmaintained.

Trax is an end-to-end library for deep learning that focuses on clear code and speed. It is actively used and maintained in theGoogle Brain team. This notebook (run it in colab) shows how to use Trax and where you can find more information.

Trax includes basic models (like ResNet, LSTM, Transformer) and RL algorithms (like REINFORCE, A2C, PPO). It is also actively used for research and includes new models like the Reformer and new RL algorithms like AWR. Trax has bindings to a large number of deep learning datasets, including Tensor2Tensor and TensorFlow datasets.

You can use Trax either as a library from your own python scripts and notebooks or as a binary from the shell, which can be more convenient for training large models. It runs without any changes on CPUs, GPUs and TPUs.

Optax is a gradient processing and optimization library for JAX. It is designed to facilitate research by providing building blocks that can be recombined in custom ways in order to optimise parametric models such as, but not limited to, deep neural networks.

Our goals are to

Provide readable, well-tested, efficient implementations of core components,

Improve researcher productivity by making it possible to combine low level ingredients into custom optimiser (or other gradient processing components).

Accelerate adoption of new ideas by making it easy for anyone to contribute.

We favour focusing on small composable building blocks that can be effectively combined into custom solutions. Others may build upon these basic components more complicated abstractions. Whenever reasonable, implementations prioritise readability and structuring code to match standard equations, over code reuse.

Numpyro seems to be the dominantprobabilistic programming system. It is a jax port/implementation/something of thepytorch classic, Pyro.

More fringe but possibly interesting,jax-md does molecular dynamics.ladax “LADAX: Layers of distributions using FLAX/JAX” does some kind of latent RV something.

The creators of Stheno eem to beInvenia, some of whose staff I am connected to in various indirect ways. It targets jax as one of several backends via a generic backend library,wesselb/lab: A generic interface for linear algebra backends.

Placeholder; details TBD.

Trying to doinference with differential equations? Can’t usejulia? Jax might do instead.

Diffrax is aJAX-based library providing numerical differential equation solvers.

Hardware accelerated, batchable and differentiable optimizers inJAX.

Hardware accelerated:our implementations run on GPU and TPU, in addition to CPU.Batchable:multiple instances of the same optimization problem can be automatically vectorized using JAX's vmap.Differentiable:optimization problem solutions can be differentiated with respect to their inputs either implicitly or via autodiff of unrolled algorithm iterations.

TF2JAX is an experimental library for converting TensorFlow functions/graphs to JAX functions.

Jax natively handles multi-GPU, viapmap, but how ot use it? Thehaiku example makes it clearer.

How do we get networks into and out of the Jax ecosystem?

Blondel, Mathieu, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, and Jean-Philippe Vert. 2021.“Efficient and Modular Implicit Differentiation.”*arXiv:2105.15183 [Cs, Math, Stat]*, October.

Hessel, Matteo, David Budden, Fabio Viola, Mihaela Rosca, Eren Sezener, and Tom Hennigan. 2020.“Optax: Composable Gradient Transformation and Optimisation, in JAX!”

Kidger, Patrick, and Cristian Garcia. 2021.“Equinox: Neural Networks in JAX via Callable PyTrees and Filtered Transformations.”*arXiv:2111.00254 [Cs]*, October.

Krämer, Nicholas, Nathanael Bosch, Jonathan Schmidt, and Philipp Hennig. 2021.“Probabilistic ODE Solutions in Millions of Dimensions.” arXiv.

This article was originally split off fromautoML, although neither topic is a strict subset of the other.

The art of choosing the best hyperparameters for a ML model’s algorithms, of which there may be many.

Should one bother getting fancy about this? Ben Recht argues that oftenrandom search is competitive with highly tuned Bayesian methods in hyperparameter tuning. Kevin Jamieson arguesyou can be cleverer than that though. Let’s inhale some hype.

In practice this hyperparameter thing is integrated with the problem both ofconfiguring ML and oftracking progress; See also those pages for practical implementation notes.

Loosely, we think of interpolating between observations of a loss surface and guessing where the optimal point is. SeeBayesian optimisation. This is generic. Not as popular in practice as I might have assumed because it turns out to be fairly greedy with data and does not exploit problem-specific ideas, such as early stopping, which is saves time and is in any case auseful type of neural net regularisation.

This leads to difficulty. Seemulti-objective optimisation.

Just what you would think.

Now it comes in an adaptive flavour that leverages the SGD fitting method e.g.Liam Li et al. (2020). calledhyperbandLisha Li et al. (2017)/ ASHA.

A synoptic overview of the trendiest strategies can be found in Peter Cotton’smicroprediction/humpday: Elo ratings for global black box derivative-free optimizers:

Behold! Fifty strategies assignedElo ratings depending on dimension of the problem and number of function evaluations allowed.

Hello and welcome to HumpDay, a package that helps you choose a Python global optimizer package, and strategy therein, fromAx-Platform,bayesian-optimization,DLib,HyperOpt,NeverGrad,Optuna,Platypus,PyMoo,PySOT, Scipyclassic andshgo,Skopt,nlopt,Py-Bobyaq,UltraOpt and maybe others by the time you read this. It also presents

someof their functionality in a common calling syntax.

The introductory blog posts are enlightening:

- Comparing Python Global Optimization Packages
- HumpDay: A Package to Help You Choose a Python Global Optimizer

Most of the implementations use, explicitly or implicitly, asurrogate model for parameter tuning, but wrap it with some tools to control and launch experiments in parallel, early termination etc.

Arranged so that the top few are hyped and popular and after that are less renowed hipster options.

Not yet filed:

- Keras Tuner
- Tune: Scalable Hyperparameter Tuning — Ray v2.0.0.dev0
- Welcome To Neural Network Intelligence !!! — An open source AutoML toolkit for neural architecture search, model compression and hyper-parameter tuning (NNI v2.0)
- AutoGluon: AutoML Toolkit for Deep Learning — AutoGluon Documentation 0.0.14 documentation

determined includeshyperparameter tuning which is not in fact a surrogate surface, but anearly stopping pruning of crappy models in a random search, i.e. fancy random search.

Tune is a Python library for experiment execution and hyperparameter tuning at any scale. Core features:

- Launch a multi-nodedistributed hyperparameter sweep in less than 10 lines of code.
- Supports any machine learning framework,including PyTorch, XGBoost, MXNet, and Keras.
- Automatically managescheckpoints and logging toTensorBoard.
- Choose among state of the art algorithms such asPopulation Based Training (PBT),BayesOptSearch,HyperBand/ASHA(Liam Li et al. 2020).

optuna(Akiba et al. 2019) supports fancy neural net training; similar to hyperopt AFAICT except that is supports Covariance Matrix Adaptation, whatever that is ? (seeHansen (2016)).

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Thanks to our define-by-run API, the code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.

`hyperopt`

J. Bergstra, Yamins, and Cox (2013)

is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.

Currently two algorithms are implemented in hyperopt:

- Random Search
- Tree of Parzen Estimators (TPE)
Hyperopt has been designed to accommodate Bayesian optimization algorithms based on Gaussian processes and regression trees, but these are not currently implemented.

All algorithms can be run either serially, or in parallel by communicating via MongoDB or Apache Spark

auto-sklearn has recently been upgraded. Details TBD(Feurer et al. 2020).

`skopt`

(aka`scikit-optimize`

)

[…] is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization.

Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper(Snoek, Larochelle, and Adams 2012).

The code consists of several parts. It is designed to be modular to allow swapping out various ‘driver’ and ‘chooser’ modules. The ‘chooser’ modules are implementations of acquisition functions such as expected improvement, UCB or random. The drivers determine how experiments are distributed and run on the system. As the code is designed to run experiments in parallel (spawning a new experiment as soon a result comes in), this requires some engineering.

`Spearmint2`

is similar, but more recently
updated and fancier; however it has a restrictive license prohibiting wide
redistribution without the payment of fees. You may or may not wish to trust
the implied level of development and support of 4 Harvard Professors,
depending on your application.

Both of the Spearmint options (especially the latter) have opinionated
choices of technology stack in order to do their optimizations, which means
they can do more work for you, but require more setup, than a simple little
thing like`skopt`

.
Depending on your computing environment this might be an overall plus or a
minus.

`SMAC`

(AGPLv3)

(sequential model-based algorithm configuration) is a versatile tool for optimizing algorithm parameters (or the parameters of some other process we can run automatically, or a function we can evaluate, such as a simulation).

SMAC has helped us speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions. Recently, we have also found it to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms. Finally, the predictive models SMAC is based on can also capture and exploit important information about the model domain, such as which input variables are most important.

We hope you find SMAC similarly useful. Ultimately, we hope that it helps algorithm designers focus on tasks that are more scientifically valuable than parameter tuning.

Python interface throughpysmac.

Won the land-grab for the name`automl`

but is now unmaintained.

A quick overview of buzzwords, this project automates:

- Analytics (pass in data, and auto_ml will tell you the relationship of each variable to what it is you’re trying to predict).
- Feature Engineering (particularly around dates, and soon, NLP).
- Robust Scaling (turning all values into their scaled versions between the range of 0 and 1, in a way that is robust to outliers, and works with sparse matrices).
- Feature Selection (picking only the features that actually prove useful).
- Data formatting (turning a list of dictionaries into a sparse matrix, one-hot encoding categorical variables, taking the natural log of y for regression problems).
- Model Selection (which model works best for your problem).
- Hyperparameter Optimization (what hyperparameters work best for that model).
- Ensembling Subpredictors (automatically training up models to predict smaller problems within the meta problem).
- Ensembling Weak Estimators (automatically training up weak models on the larger problem itself, to inform the meta-estimator’s decision).

Abdel-Gawad, Ahmed, and Simon Ratner. 2007.“Adaptive Optimization of Hyperparameters in L2-Regularised Logistic Regression.”

Akiba, Takuya, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019.“Optuna: A Next-Generation Hyperparameter Optimization Framework.” In*Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*.

Bengio, Yoshua. 2000.“Gradient-Based Optimization of Hyperparameters.”*Neural Computation* 12 (8): 1889–1900.

Bergstra, James S., Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011.“Algorithms for Hyper-Parameter Optimization.” In*Advances in Neural Information Processing Systems*, 2546–54. Curran Associates, Inc.

Bergstra, James, and Yoshua Bengio. 2012.“Random Search for Hyper-Parameter Optimization.”*Journal of Machine Learning Research* 13: 281–305.

Bergstra, J, D Yamins, and D D Cox. 2013.“Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures.” In*ICML*, 9.

Domke, Justin. 2012.“Generic Methods for Optimization-Based Modeling.” In*International Conference on Artificial Intelligence and Statistics*, 318–26.

Eggensperger, Katharina, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger H. Hoos, and Kevin Leyton-Brown. n.d.“Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters.”

Eigenmann, R., and J. A. Nossek. 1999.“Gradient Based Adaptive Regularization.” In*Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468)*, 87–94.

Feurer, Matthias, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020.“Auto-Sklearn 2.0: The Next Generation.”*arXiv:2007.04074 [Cs, Stat]*, July.

Feurer, Matthias, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015.“Efficient and Robust Automated Machine Learning.” In*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2962–70. Curran Associates, Inc.

Foo, Chuan-sheng, Chuong B. Do, and Andrew Y. Ng. 2008.“Efficient Multiple Hyperparameter Learning for Log-Linear Models.” In*Advances in Neural Information Processing Systems 20*, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 377–84. Curran Associates, Inc.

Franceschi, Luca, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017a.“On Hyperparameter Optimization in Learning Systems.” In.

———. 2017b.“Forward and Reverse Gradient-Based Hyperparameter Optimization.” In*International Conference on Machine Learning*, 1165–73. PMLR.

Fu, Jie, Hongyin Luo, Jiashi Feng, Kian Hsiang Low, and Tat-Seng Chua. 2016.“DrMAD: Distilling Reverse-Mode Automatic Differentiation for Optimizing Hyperparameters of Deep Neural Networks.” In*PRoceedings of IJCAI, 2016*.

Gelbart, Michael A., Jasper Snoek, and Ryan P. Adams. 2014.“Bayesian Optimization with Unknown Constraints.” In*Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence*, 250–59. UAI’14. Arlington, Virginia, United States: AUAI Press.

Grünewälder, Steffen, Jean-Yves Audibert, Manfred Opper, and John Shawe-Taylor. 2010.“Regret Bounds for Gaussian Process Bandit Problems.” In, 9:273–80.

Hansen, Nikolaus. 2016.“The CMA Evolution Strategy: A Tutorial.”*arXiv:1604.00772 [Cs, Stat]*, April.

Hutter, Frank, Holger H. Hoos, and Kevin Leyton-Brown. 2011.“Sequential Model-Based Optimization for General Algorithm Configuration.” In*Learning and Intelligent Optimization*, 6683:507–23. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, Berlin, Heidelberg.

Hutter, Frank, Holger Hoos, and Kevin Leyton-Brown. 2013.“An Evaluation of Sequential Model-Based Optimization for Expensive Blackbox Functions.” In*Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation*, 1209–16. GECCO ’13 Companion. New York, NY, USA: ACM.

Jamieson, Kevin, and Ameet Talwalkar. 2015.“Non-Stochastic Best Arm Identification and Hyperparameter Optimization.”*arXiv:1502.07943 [Cs, Stat]*, February.

Kandasamy, Kirthevasan, Akshay Krishnamurthy, Jeff Schneider, and Barnabas Poczos. 2018.“Parallelised Bayesian Optimisation via Thompson Sampling.” In*International Conference on Artificial Intelligence and Statistics*, 133–42. PMLR.

Li, Liam, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2020.“A System for Massively Parallel Hyperparameter Tuning.”*arXiv:1810.05934 [Cs, Stat]*, March.

Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2016.“Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits.”*arXiv:1603.06560 [Cs, Stat]*, March.

———. 2017.“Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.”*The Journal of Machine Learning Research* 18 (1): 6765–6816.

Liu, Hanxiao, Karen Simonyan, and Yiming Yang. 2019.“DARTS: Differentiable Architecture Search.”*arXiv:1806.09055 [Cs, Stat]*, April.

Lorraine, Jonathan, and David Duvenaud. 2018.“Stochastic Hyperparameter Optimization Through Hypernetworks.”*arXiv:1802.09419 [Cs]*, February.

Lorraine, Jonathan, Paul Vicol, and David Duvenaud. 2020.“Optimizing Millions of Hyperparameters by Implicit Differentiation.” In*International Conference on Artificial Intelligence and Statistics*, 1540–52. PMLR.

MacKay, David JC. 1999.“Comparison of Approximate Methods for Handling Hyperparameters.”*Neural Computation* 11 (5): 1035–68.

Maclaurin, Dougal, David Duvenaud, and Ryan Adams. 2015.“Gradient-Based Hyperparameter Optimization Through Reversible Learning.” In*Proceedings of the 32nd International Conference on Machine Learning*, 2113–22. PMLR.

Močkus, J. 1975.“On Bayesian Methods for Seeking the Extremum.” In*Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1–7, 1974*, edited by G. I. Marchuk, 400–404. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

O’Hagan, A. 1978.“Curve Fitting and Optimal Design for Prediction.”*Journal of the Royal Statistical Society: Series B (Methodological)* 40 (1): 1–24.

Platt, John C., and Alan H. Barr. 1987.“Constrained Differential Optimization.” In*Proceedings of the 1987 International Conference on Neural Information Processing Systems*, 612–21. NIPS’87. Cambridge, MA, USA: MIT Press.

Real, Esteban, Chen Liang, David R. So, and Quoc V. Le. 2020.“AutoML-Zero: Evolving Machine Learning Algorithms From Scratch,” March.

Salimans, Tim, Diederik Kingma, and Max Welling. 2015.“Markov Chain Monte Carlo and Variational Inference: Bridging the Gap.” In*Proceedings of the 32nd International Conference on Machine Learning (ICML-15)*, 1218–26. ICML’15. Lille, France: JMLR.org.

Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. 2012.“Practical Bayesian Optimization of Machine Learning Algorithms.” In*Advances in Neural Information Processing Systems*, 2951–59. Curran Associates, Inc.

Snoek, Jasper, Kevin Swersky, Rich Zemel, and Ryan Adams. 2014.“Input Warping for Bayesian Optimization of Non-Stationary Functions.” In*Proceedings of the 31st International Conference on Machine Learning (ICML-14)*, 1674–82.

Srinivas, Niranjan, Andreas Krause, Sham M. Kakade, and Matthias Seeger. 2012.“Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design.”*IEEE Transactions on Information Theory* 58 (5): 3250–65.

Swersky, Kevin, Jasper Snoek, and Ryan P Adams. 2013.“Multi-Task Bayesian Optimization.” In*Advances in Neural Information Processing Systems 26*, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 2004–12. Curran Associates, Inc.

Thornton, Chris, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2013.“Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms.” In*Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 847–55. KDD ’13. New York, NY, USA: ACM.

Turner, Ryan, David Eriksson, Michael McCourt, Juha Kiili, Eero Laaksonen, Zhen Xu, and Isabelle Guyon. 2021.“Bayesian Optimization Is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020.”*arXiv:2104.10201 [Cs, Stat]*, April.

Wang, Ruochen, Minhao Cheng, Xiangning Chen, Xiaocheng Tang, and Cho-Jui Hsieh. 2020.“Rethinking Architecture Selection in Differentiable NAS.” In.

Various questions about economics of social changes wrought by ready access to LLMs, thelatest generation of automation. This is the major “short”-“medium” effect whatever those words mean. Longer-sighted persons might also care aboutwhether AI will replace us with grey goo.

Thought had in conversation withRichard Scalzo aboutSmith (2022):

My mental model for disruptive technology is always in reference to snowmobiles(Pelto 1973).
and an aside from ?Steve Jobs? that the PC should be a*bicycle for the mind.*
I am interested in knowing what is more*bicycle* for the mind (democratising, enabling even underdogs) and what is a*snowmobile* (cementing disparities, increasing returns to incumbents).

How do foundation models/ large language models change the economics of knowledge production? Of art production?
To a first order approximation (valid at 03/2023) LLMs provide a way of*massively compressing collective knowledge and synthesising the bits I need on demand*.
They are not*yet* directly generating novel knowledge (whatever that means).
But they do seem to be pretty good at being “nearly as smart as everyone on the internet combined”.
There is no sharp boundary between these ideas, clearly.

Deploying these models will test various hypotheses about how much of collective knowledge depends upon our participating in boring boilerplate grunt work, and what incentives are necessary to encourage us to produce and share our individual contributions to that collective intelligence.

Here is where I formulate those.

Think of Matt Might’s iconicillustrated guide to a Ph.D..

Here’s my question: Does the new map look something like this? If so is that a problem?

I’m shakier on this hypothesis. TBC

Economics of production at a microscopic, individual scale:

GPT and the Economics of Cognitively Costly Writing Tasks

To analyze the effect of GPT-4 on labor efficiency and the optimal mix of capital to labor for workers who are good at using GPT versus those who aren’t when it comes to performing cognitively costly tasks, we will consider the Goldin and Katz modified Cobb-Douglas production function

Charlie Stross’s 2010Spamularity stuck with me:

We are currently in the early days of an arms race, between the spammers and the authors of spam filters. The spammers are writing software to generate personalized, individualized wrappers for their advertising payloads that masquerade as legitimate communications. The spam cops are writing filters that automate the process of distinguishing a genuinely interesting human communication from the random effusions of a ’bot. And with each iteration, the spam gets more subtly targeted, and the spam filters get better at distinguishing human beings from software, in a bizarre parody of the imitation game popularized by Alan Turing (in which a human being tries to distinguish between another human being and a piece conversational software via textual communication) — an early ad hoc attempt to invent a pragmatic test for artificial intelligence.

We have one faction that is attempting to write software that can generate messages that can pass a Turing test, and another faction that is attempting to write software that can administer an ad-hoc Turing test. Each faction has a strong incentive to beat the other. This is the classic pattern of an evolutionary predator/prey arms race: and so I deduce that if symbol-handling, linguistic artificial intelligence is possible at all, we are on course for a very odd destination indeed—the Spamularity, in which those curious lumps of communicating meat give rise to a meta-sphere of discourse dominated by parasitic viral payloads pretending to be meat…

Maggie Appleton’s commentary onDark Forest theory of the Internet by Yancey Strickler is another neat framing:

Thedark forest theory of the web points to the increasingly life-like but life-less state of being online. Most open and publicly available spaces on the web are overrun with bots, advertisers, trolls, data scrapers, clickbait, keyword-stuffing “content creators,” and algorithmically manipulated junk.

It’s like a dark forest that seems eerily devoid of human life—all the living creatures are hidden beneath the ground or up in trees. If they reveal themselves, they risk being attacked by automated predators.

Humans who want to engage in informal, unoptimised, personal interactions have to hide in closed spaces like invite-only Slack channels, Discord groups, email newsletters, small-scale blogs, anddigital gardens. Or make themselvesillegible and algorithmically incoherent in public venues.

I feel like I’m going to lose this battle, but for the record, I despise the term “textpocalypse”.

Fake news, credibility, cryptographic verification of provenance, etc. TBC

- Ted Chiang,Will A.I. Become the New McKinsey? looks at LLMs through the lense of Piketty as increasing returns to capital vs returns to labour
- Leaked Google document: “We Have No Moat, And Neither Does OpenAI” asserts that large corporates are concerned that LLMs do not provide sufficient relative return to capital

George Hosu, in a short aside, highlights the incredible marketing advantage of AI:

People that failed to lift a finger to integrate better-than-doctors or work-with-doctors supervised medical models for half a century are stoked at a chatbot being as good as an average doctor and can’t wait to get it to triage patients

Google’s Bard was undone on day two by an inaccurate response in the demo video where it suggested that the James Webb Space Telescope would take the first images of exoplanets. This

soundslike something the JWST would do but it’s not at all true Soone tweet from an astrophysicist sank Alphabet’s value by 9%. This says a lot about how a) LLMs are like* being at the pub with friends, it can say things that sound plausible and true enough and no one really needs to check because who cares? Except we do because this is science, not a lads’ night out and b) the insane speculative volatility of this AI bubble that the hype is so razor thin it can be undermined by a tweet with 44 likes.I had a wonder if there’s any exploration of the ‘thickness’ of hype.Jack Stilgoe suggested looking atBorup et al which is evergreen but I feel like there’s something about the resilience of hype: Like crypto was/is pretty thin in the scheme of things. High levels of hype but frenetic, unstable and quick to collapse. AI has pretty consistent if pulsating hype gradually growing over the years while something like nuclear fusion is super-thick (at least in the popular imagination) – remaining through decades of not-quite-ready andgrasping the slightest indication of success. I don’t know, if there’s nothing specifically on this, maybe I should write it one day.

TBC

- What Will Transformers Transform? – Rodney Brooks
- Tom Stafford onChatGPT as Ouija bourd
- Gradient Dissent, a list of reasons that large backpropagation-trained networks might be worrisome. There are some interesting points in there, and some hyperbole. Also: If it were true that there are externalities from backprop networks (i.e. that they are a kind of methodological
*pollution*that produces private benefits but public costs) then what kind of mechanisms*should*be applied to disincentives them? - C&CAgainst Predictive Optimization

Andrus, McKane, Sarah Dean, Thomas Krendl Gilbert, Nathan Lambert, and Tom Zick. 2021.“AI Development for the Public Interest: From Abstraction Traps to Sociotechnical Risks.” arXiv.

Barke, Shraddha, Michael B. James, and Nadia Polikarpova. 2022.“Grounded Copilot: How Programmers Interact with Code-Generating Models.” arXiv.

Bowman, Samuel R. 2023.“Eight Things to Know about Large Language Models.” arXiv.

Hall, Lawrence. 1975.“The Snowmobile Revolution: Technology and Social Change in the Arctic. Pertti J. Pelto.”*Economic Development and Cultural Change* 23 (4): 769–71.

Métraux, Alfred. 1956.“A Steel Axe That Destroyed a Tribe, as an Anthropologist Sees It.”*The UNESCO Courier: A Window Open on the World* IX: 26–27.

Pelto, Pertti J. 1973.*The snowmobile revolution: technology and social change in the Arctic*. Waveland Press.

Shanahan, Murray. 2023.“Talking About Large Language Models.” arXiv.

Smith, Justin E. H. 2022.*The Internet Is Not What You Think It Is: A History, a Philosophy, a Warning*. Princeton: Princeton University Press.

Things that I think should be noted and filed in an orderly fashion, but which I lack time to address right now. Content will change incessantly.

I need to reclassify thebio computing links; that section has become confusing and there are too many nice ideas there not clearly distinguished.

All done

Experiment metascience grant

Via Louise Ord,virtuosic algorithmic line art in MetaPost

FAQ, Frequently Asked Questions about HandWiki Encyclopedia of science and computing

Matthew Feeney,Markets in fact-checking

Étienne Fortier-Dubois,The elements of scientific style

Saloni Dattani,Real peer review has never been tried

pyg-team/pytorch_geometric: Graph Neural Network Library for PyTorch

Eight Graphs That Explain Software Engineering Salaries in 2023

Typst: Compose papers faster reimagines LaTeX with modern tech and workflow. I think they are doomed because while they solve the horrible, awful, nastybad misfeatures of most of the LaTeX workflow, they also do not support the part of LaTeX which people (rather than journals) need, which is to say, the mathematical markup.

LASER-UMASS/Themis: Themis™ is a software fairness tester.git

Brittany Johnson-Matthews: Causal testing: understanding the root causes of defects

Tianyi Zhang: Interactive Debugging and Testing Support for Deep Learning

DeepSeer: Interactive RNN Explanation and Debugging via State Abstraction /momentum-lab-workspace/DeepSeer

Eugenio Culurciello,The fall of RNN / LSTM. We fell for Recurrent neural networks…

by Eric Topol,When M.D. is a Machine Doctor.

Women in AI awards

How GNNs and Symmetries can help to solve PDEs - Max Welling - YouTube

From automatic differentiation to message passing - minka-acmll2019-slides.pdf *From automatic differentiation to message passing - Microsoft Research

Bernhard Schölkopf: From statistical to causal learning - YouTube

Bernhard Schölkopf: Learning Causal Mechanisms (ICLR invited talk) - YouTube

XPRIZE Wildfire | XPRIZE Foundation* Grettonlecture4_introToRKHS.pdf

Flyte: An Open Source Orchestrator for ML/AI Workflows - The New Stack

Build production-grade data and ML workflows, hassle-free with Flyte*Cloud-Native Geospatial Foundation

The Cloud-Native Geospatial Foundation is a forthcoming initiative from Radiant Earth created to increase adoption of highly efficient approaches to working with geospatial data in public cloud environments.

fast.ai - Mojo may be the biggest programming language advance in decades

Nostr, a simple protocol for decentralizing social media that has a chance of working

Lilian Weng’s updatedThe Transformer Family Version 2.0

Sam Kriss, inAll the nerds are dead, conflates geeks and nerds, but is funny anyway

The General Theory of Employment, Interest and Money by John Maynard Keynes

The reasonable(?) effectiveness of data analysis

Why is it that we can be thrown into the work of other people, in a field we have zero experience in, and have any expectation of making any useful impact at all? When stated objectively, it sounds utterly ridiculous. But in my experience, a data team can find something to make an improvement on, even if the impact can sometimes be small.

Tackling Collaboration Challenges in the Development of ML-Enabled Systems “I highlight the findings of a study on which I teamed up with colleagues Nadia Nahar (who led this work as part of her PhD studies at Carnegie Mellon University and Christian Kästner (also from Carnegie Mellon University) and Shurui Zhou (of the University of Toronto).The study sought to identify collaboration challenges common to the development of ML-enabled systems. Through interviews conducted with numerous individuals engaged in the development of ML-enabled systems, we sought to answer our primary research question: What are the collaboration points and corresponding challenges between data scientists and engineers? We also examined the effect of various development environments on these projects. Based on this analysis, we developed preliminary recommendations for addressing the collaboration challenges reported by our interviewees.”

Probability Is Not A Substitute For Reasoning – Ben Landau-Taylor

Self-Healing Concrete: What Ancient Roman Concrete Can Teach Us

Differentiating the discrete: Automatic Differentiation meets Integer Optimization | μβ

Information Transfer Economics: Organization of information equilibrium concepts

Serge Zaitsev,World’s smallest office suite

Annie Lowrey,We Haven’t Been Measuring How the Economy Really Works

Alex Komoroske,On Schelling Points in Organizations

Alex Komoroske,Coordination Headwind - How Organizations Are Like Slime Molds

Alternative to the tedious openhub workflow:analyzemyrepo.com | about

TILApophenia vsPareidolia

Matthew Feeney,Markets in fact-checking

Étienne Fortier-Dubois,The elements of scientific style

Saloni Dattani,Real peer review has never been tried

Jason Collins,We don’t have a hundred biases, we have the wrong model

Schimmack onPsychological Science and Real World Racism

If there are already smarter people around, how can I find good ideas?

Are Model Explanations Useful in Practice? Rethinking How to Support Human-ML Interactions.

Colossal-AI is designed to be a unified system to provide an integrated set of training skills and utilities to the user. You can find the common training utilities such as mixed precision training and gradient accumulation. Besides, we provide an array of parallelism including data, tensor and pipeline parallelism. We optimize tensor parallelism with different multi-dimensional distributed matrix-matrix multiplication algorithm. We also provided different pipeline parallelism methods to allow the user to scale their model across nodes efficiently. More advanced features such as offloading can be found in this tutorial documentation in detail as well.

dynamicslab/pykoopman: A package for computing data-driven approximations to the Koopman operator.

Neural Transducer Training: Reduced Memory Consumption with Sample-wise Computation

Building Resilient Organizations: Toward Joy and Durable Power in a Time of Crisis

Is Anything Worth Maximizing? How metrics shape markets, how we’re… | by Joe Edelman

factorization_machine Something something kernels, something regression something interaction effects.

Facebook, Google Give Police Data to Prosecute Abortion Seekers

Cleanlab: “We publish research, develop open source tools, and design interfaces to help you improve the quality of your datasets and diagnose various issues in them.”See their blog e.g.ActiveLab: Active Learning with Data Re-Labeling

TL; DR—In-context learning is a mysterious emergent behavior in large language models (LMs) where the LM performs a task just by conditioning on input-output examples, without optimizing any parameters. In this post, we provide a Bayesian inference framework for understanding in-context learning as “locating” latent concepts the LM has acquired from pretraining data. This suggests that all components of the prompt (inputs, outputs, formatting, and the input-output mapping) can provide information for inferring the latent concept. We connect this framework to empirical evidence where in-context learning still works when provided training examples with random outputs. While output randomization cripples traditional supervised learning algorithms, it only removes one source of information for Bayesian inference (the input-output mapping).

Bayesian Neural Networks by Duvenaud’s team

Rohit,People always put their money in futures they predict

What have we seen so far? People didn’t use to have much disposable income to invest a century ago. When they did, or rather those who did, invested their savings mostly in land or (if they were rich enough) businesses, or commodities.

Where should I invest my money is a relatively old question, but until recently it wasn’t a very interesting question. This is because until recently the answers were understood, but not that actionable. The futures would get better, things would get built, and you could ride optimism as a thesis if you could find a way how. The avenues available were extremely limited, and the optionality you had was minimal.

Olúfẹ́mi O. Táíwò,Identity Politics and Elite Capture

What’s the difference between a tutorial and how-to guide? - Diátaxis

Instagram, TikTok, and the Three Trends

the company correctly intuited a significant gap between its users stated preference — no News Feed — and their revealed preference, which was that they liked News Feed quite a bit. The next fifteen years would prove the company right.

Kedro | A Python framework for creating data science code /Kedro Frequently asked questions. Kedro rationale by Joel Schwarzmann:The importance of layered thinking in data engineering

Darts:

Prof Steve Keen | Creating realistic economics for the post-crash world

Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. But using these LLMs in isolation is often not enough to create a truly powerful app - the real power comes when you are able to combine them with other sources of computation or knowledge.

This library is aimed at assisting in the development of those types of applications.

How did places like Bell Labs know how to ask the right questions?

Color Oracle simulates color blindness for accessibility of visualisations and plots etc

darrenjw/fp-ssc-course: An introduction to functional programming for scalable statistical computing

Do organizations have to get slower as they grow? (with Alex Komoroske)

Kolibri is an open-source educational platform specially designed to provide offline access to a wide range of quality, openly licensed educational resources in low-resource contexts like rural schools, refugee camps, orphanages, and also in non-formal school programs.

What even areGFlownets?

Team Silverblue — About packaged apps for Fedora

The Adaptable Linux Platform Guide PAckaed apps for SUSE.

The Carr–Madan formula is really just a special case of a Taylor expansion. For completeness, let’s rederive the Taylor expansion with an integral remainder.

When explaining becomes a sin—by Tom Stafford file under taboos and Tetlock and compassion/comprehension

Karloo Pools and the hidden alternative swimming spots nearby—Walk My World

Cult Classic ’Fight Club’ Gets a Very Different Ending in China

A Turkish Farmer Tests Out VR Goggles on Cows To Get More Milk

How to buy a social network, with Tumblr CEO Matt Mullenweg—The Verge

Fake Feelings—ai emo. When post-hardcore emo band Silverstein… | by Dadabots—Medium

Why the super rich are inevitable

Meanwhile, the richer player will gain money. That’s because, from their perspective, every game they lose means they have an opportunity to win it back—and then some—in the next coin flip. Every game they win means, no matter what happens in the next coin flip, they’ll still be at a net-plus.

Repeat this process millions of times with millions of people, and you’re left with one very rich person.

Pluralistic: Tiktok’s enshittification (21 Jan 2023) – Pluralistic: Daily links from Cory Doctorow

Pluralistic: EU to Facebook, ’Drop Dead’ (07 Dec 2022) – Pluralistic: Daily links from Cory Doctorow

In Which Long-Time Netizen & Programmer-at-Arms Dave Winer Records a Podcast for Me, Personally

DRMacIver’s Notebook: Three key problems with Von-Neumann Morgenstern Utility Theory

The first part is about physical difficulties with measurement—you can only know the probabilities up to some finite precision. VNM theory handwaves this away by saying that the probabilities are perfectly known, but this doesn’t help you because that just moves the problem to be a computational one, and requires you to be able to solve the halting problem. e.g. choose between\(L_1=p B+(1-p) W\) and\(L_2=q B+(1-q) W\) where\(p=0.0 \ldots\) until machine\(M 1\) halts and 1 after and\(q\) is the same but for machine\(M 2\).

The second demonstrates that what you get out of the VNM theorem is not a utility function. It is an algorithm that produces a sequence converging to a utility function, and you cannot recreate even the original decision procedure from that sequence without being able to take the limit (which requires running an infinite computation, again giving you the ability to solve the halting problem) near the boundary.

Supervised Training of Conditional Monge Maps—Apple Machine Learning Research

How To Be an Academic Hyper-Producer—Economics from the Top Down

A global analysis of matches and mismatches between human genetic and linguistic histories—PNAS

Desmos—Let’s learn together. graphing calculator online

The Cause of Depression Is Probably Not What You Think—Quanta Magazine

What Monks Can Teach Us About Paying Attention—The New Yorker

Actually, Japan has changed a lot—by Noah Smith — japanese real estate is surprsising

One Useful Thing (And Also Some Other Things) | Ethan Mollick—Substack

The radical idea that people aren’t stupid paired withHow to achieve self-control without “self-control”

Colonialism did not cause the Indian famines—History Reclaimed

Erik van Zwet,Shrinkage Trilogy Explainer on modelling the publication process

Mathematics of the impossible: Computational Complexity—Thoughts

Download the Atkinson Hyperlegible Font—Braille Institute What makes it different from traditional typography design is that it focuses on letterform distinction to increase character recognition, ultimately improving readability. We are making it free for anyone to use!

Low-Rank Approximation Toolbox: Nyström Approximation—Ethan Epperly

-ise or-ize? Is-ize American? (1/3) – Jeremy Butterfield Editorial

Iron deficiencies are very bad and you should treat them—Aceso Under Glass

The Australian academic STEMM workplace post-COVID: a picture of disarray

torchgeo—torchgeo 0.3.1 documentation/microsoft/torchgeo: TorchGeo: datasets, samplers, transforms, and pre-trained models for geospatial data

Merve Emre,Has Academia Ruined Literary Criticism?

Matt Clancy,Age and the Nature of Innovation “Are there some kinds of discoveries that are easier to make when young, and some that are easier to make when older”?

Tom Stafford,Microarguments and macrodecisions

Kevin Munger,Why I am (Still) a Conservative (For Now)

Kevin Munger,Facebook is Other People

Randy Au, inData science has a tool obsession talks about Gear Acquisition Syndrome for data scientists.

Clive Thompson,The Power of Indulging Your Weird, Offbeat Obsessions

omg.lol - A lovable web page and email address, just for you

Donate to a highly effective charity - Effective Altruism Australia. Focussed on poverty and health interventions.

What are the best charities to donate to in 2023? · Giving What We Can

karpathy/nanoGPT: The simplest, fastest repository for training/finetuning medium-sized GPTs.

What is the “forward-forward” algorithm, Geoffrey Hinton’s new AI technique?

Simon Willison,AI assisted learning: Learning Rust with ChatGPT, Copilot and Advent of Code

Fission: Build the future of web apps at the edge incubates several decentralized protocols

danah boyd,What if failure is the plan?. I’ve been thinking a lot about failure…

Mastodon—and the pros and cons of moving beyond Big Tech gatekeepers

Michael Nielsen on science online

Great bloggers are rare, weird, and not team players – Kevin Drum

Swayable: RCTs for marketing campaigns via ingenious audience recruiting network

Zoomers Co-Working Community (co-working for accountability)

Normconf Lightning Talks/Normconf: The Normcore Tech Conference — a conference on the stuff that we actually need to do in ML, as opp. the stuff we would like to pretend is what we do.

Jean Gallier and Jocelyn Quaintance ,Algebra, Topology, Differential Calculus, and Optimization Theory for Computer Science and Machine Learning, 2188 pages as of 2022/10/30, and growing.

Terence Eden,You can have user accounts without needing to manage user accounts

Adam Mastroianni, Ludwin-Peery, EJ,Things could be better

Adam Mastroianni,The great myths of political hatred

Big correlations and big interactions ([2105.13445] The piranha problem: Large effects swimming in a small pond)

How to keep cakes moist and cause the greatest tragedies of the 20th century

Distribution testing

GPflow/GeometricKernels: Geometric kernels on manifolds, meshes and graphs

George Ho,How to Improve Your Static Site's Typography (for code formatting)

Invasive Diffusion: How one unwilling illustrator found herself turned into an AI model

Microsoft CSR’sLaw Enforcement Request Report is disconcerting transparency

Marc ten Bosch,Let's remove Quaternions from every 3D Engine (An Interactive Introduction to Rotors from Geometric Algebra)

Michele Coscia,Meritocracy vs Topocracy

oxcsml/riemannian-score-sde: Score-based generative models for compact manifolds

Public-facing Censorship Is Safety Theater, Causing Reputational Damage

Ti John’sPublications

Starboard, a shareable in-browser notebook that runs pyton (!)

Students Are Using AI to Write Their Papers, Because Of Course They Are

Treehugger Introduces a Modern Pyramid of Energy Conservation

Vast.ai “Rent Cloud GPU Servers for Deep Learning and AI”

Adam Mastroianni,Things could be better

Michael Burnam-Fink,What is Scientific about Data Science?

Christian Lawson-Perfect’sInteresting Esoterica is a collection of weird papers in maths.

Erik Hoel,Why do most popular science books suck?

Étienne Fortier-Dubois,The Vibes Are Off

George Ho,Understanding NUTS and HMC

Gordon Brander,Coevolution creates living complexity

Gordon Brander,Thinking together, on egregores, Dunbar numbers andinformation-processing thresholds in Holocene social evolution, all to motivate

Kate Mannell, Eden T. SmithAlternative Social Media and the Complexities of a More Participatory Culture: A View From Scuttlebutt

Peter Woit,Symmetry and Physics

Rob J Hyndman,We need more open data in Australia

Vicki Boykis,How I learn machine learning

Oshan Jarow,Markets Underinvest In Vitality

Spirals of Delusion: How AI Distorts Decision-Making and Makes Dictators More Dangerous (not convinced tbh)

Erik Hoel,The gossip trap

The Developer Certificate of Origin is a great alternative to a CLA

I. Risk Management Foundations - Machine Learning for Financial Risk Management with Python [Book]

jkbren/einet: Uncertainty and causal emergence in complex networks

Darren Wilkinson’sBayesian inference for a logistic regression model 1,2,3,4,5

Book Review: Public Choice Theory And The Illusion Of Grand Strategy

Stephen Malina —Deriving the front-door criterion with the do-calculus

Census is a tool which links all the weird different data storage systems and CRM stuff

Michael Lewis podcast on illegible experts

Nemanja Rakicevic,NeurIPS Conference: Historical Data Analysis

Yanir Seroussi,The mission matters: Moving to climate tech as a data scientist

Keir Bradwell,#1: In-group Cheems

Samuel Moore,Why open science is primarily a labour issue

Adam Mastroianni,Against All Applications

Have The Effective Altruists And Rationalists Brainwashed Me?

Anthony Lee Zhang,The War for Eyeballs

Digital artists’ post-bubble hopes for NFTs don’t need a blockchain

Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation

I Do Not Think It Means What You Think It Means: Artificial Intelligence, Cognitive Work & Scale

ClearerThinking.org’s courses, e.g.

- Introduction to Decision Academy: The Science of Better Decisions
- Rhetorical Fallacies: Dodging Argument Traps
- Learning from Mistakes: A Systematic Approach
- Probabilistic Fallacies: Gauging the Strength of Evidence
- Explanation Freeze: Interpreting Uncertain Events
- Aspire: A Tool to Help You Improve Your Life
- The Sunk Cost Fallacy: Focusing on the Future

Reddit forAI-generated and manipulated content

PJ Vogt,Selling Drugs to Buy Crypto

Michele Coscia,Pearson Correlations for Networks

The DAIR Institute “The Distributed AI Research Institute is a space for independent, community-rooted AI research, free from Big Tech’s pervasive influence.”

Machine Learning Trick of the Day (1): Replica Trick— Shakir Mohammed

Machine Learning Trick of the Day (7): Density Ratio Trick— Shakir Mohammed

ApplyingML - Papers, Guides, and Interviews with ML practitioners

Ryan Broderick,We were the unpaid janitors of a bloated tech monopoly

fastdownload: the magic behind one of the famous 4 lines of code · fast.ai

Schneier,When AIs Start Hacking

Multimodal Neurons in Artificial Neural Networks/Distill version of Multimodal Neurons in Artificial Neural Networks

Francis BachGoing beyond least-squares – II : Self-concordant analysis for logistic regression

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Bookmarked but where will they ever go?

Dispel your justification-monkey with a “HWA!” - Malcolm Ocean

Roger’s Bacon,Living and Dying with a Mad God

washable & breathable flexiOH cast adapts to the patient’s skin

‘We can continue Pratchett’s efforts’: the gamers keeping Discworld alive

AO3’s 15-year journey from blog post to fanfiction powerhouse - The Verge

today I took a desk lamp whose Halogen light had burned out, whose crappy transformer always made those bulbs sputter, and whose mildly art-deco appearance I’d always liked, and swapped it out to run an LED bulb off USB power. It took about an hour’s work to replace the light with an LED, the switch with a nice heavy clicky one and now the whole thing runs off USB-C instead of wall voltage. It emits no appreciable heat, and if these calculations are to be believed, will run for decades for a few cents per year, assuming I leave it on all the time.

I hadn’t really appreciated how big a deal USB-PD voltage negotiation was until I found out that the little chips that handle that negotiation are about the size of the end of a pencil, that if you include the USB-C port you can replace basically any low-voltage transformer with something smaller than a quarter.

The magic search string, if you want to try this yourself, is “usb-pd trigger module”,

vscode-paste-image/README.md at master · mushanshitiancai/vscode-paste-image

mhoye/awesome-falsehood: 😱 Falsehoods Programmers Believe in

Gary Brecher,The War Nerd: Taiwan — The Thucydides Trapper Who Cried Woof

Evidence of Fraud in an Influential Field Experiment About Dishonesty. Looks bad for Dan Ariely. Damn.

on programming humans (Amir’s work)

Communications' digital initiative and its first digital event

PlayableHalf Earth Socialism simulator

flatmax/vector-synth: Old 2002 era vector synth code based on XFig

Nick Chater,Would you Stand Up to An Oppressive Regime.

Lambda School’s Job Placement Rate May Be Far Worse Than Advertised

I would like to read the diaries ofUsama ibn Munqidh

The latest target of China’s tech regulation blitz: algorithms

State Power and the Power Law,State Power and the Power Law 2

Yuling Yao,The likelihood principle in model check and model evaluation “We are (only) interested in estimating an unknown parameter\(\theta\), and there are two data generating experiments both involving\(\theta\) with observable outcomes\(y_1\) and\(y_2\) and likelihoods\(p_1\left(y_1 \mid \theta\right)\) and\(p_2\left(y_2 \mid \theta\right)\). If the outcome-experiment pair satisfies\(p_1\left(y_1 \mid \theta\right) \propto p_2\left(y_2 \mid \theta\right)\), (viewed as a function of\(\theta\) ) then these two experiments and two observations will provide the same amount of information about\(\theta\).”

Liquid Information Flow Control, a confidential computing DSL

Jag Bhalla,Vaccine Greed: Capitalism Without Competition Isn’t Capitalism, It’s Exploitation

Kostas Kiriakakis,A Day At The Park

By analyzing medical text and extracting biomedical entities and relations from the entire history of published medical science, Xyla can facilitate better real-world evidence-based clinical decision support and help make clinical research—such as research into new treatments, including de novo drug design as well as the repurposing of existing drugs—smarter and faster. In so doing, Xyla is fulfilling its mission of organizing the world’s medical knowledge and making it more useful.

My2050 calculator - create your pathway for the UK to be net zero by 2050

Is Pandemic Stress to Blame for the Rise in Traffic Deaths? Nope apparently it is decreased congestion making drivers drive faster on shit roads.

Marisa Abrajano has a provoking list of research topics. I would like to read the work to see her methodology.

Do normal people need to know or care about “the metaverse”?

Apple acquires song-shifting startup AI Music, here’s what it could mean for users

Black Americans are pessimistic about their position in U.S. society

Smart technologies | Internet Policy Review

Speaking of ‘smart’ technologies we may avoid the mysticism of terms like ‘artificial intelligence’ (AI). To situate ‘smartness’ I nevertheless explore the origins of smart technologies in the research domains of AI and cybernetics. Based in postphenomenological philosophy of technology and embodied cognition rather than media studies and science and technology studies (STS), the article entails a relational and ecological understanding of the constitutive relationship between humans and technologies, requiring us to take seriously their affordances as well as the research domain of computer science. To this end I distinguish three levels of smartness, depending on the extent to which they can respond to their environment without human intervention: logic-based, grounded in machine learning or in multi-agent systems. I discuss these levels of smartness in terms of machine agency to distinguish the nature of their behaviour from both human agency and from technologies considered dumb. Finally, I discuss the political economy of smart technologies in light of the manipulation they enable when those targeted cannot foresee how they are being profiled.

Concurrent programming, with examples

Mention concurrency and you’re bound to get two kinds of unsolicited advice: first that it’s a nightmarish problem which will melt your brain, and second that there’s a magical programming language or niche paradigm which will make all your problems disappear.

We won’t run to either extreme here. Instead we’ll cover the production workhorses for concurrent software – threading and locking – and learn about them through a series of interesting programs. By the end of this article you’ll know the terminology and patterns used by POSIX threads (pthreads).

A study of lights at night suggests dictators lie about economic growth

DIY Collective Embeds Abortion Pill Onto Business Cards, Distributes Them At Hacker Conference

Penny Wyatt,Developer Innovation and the Free Puppy

Elizabeth Van Nostrand,A Quick Look At 20% Time

Chalk is a non-terrible calculator for macos, incorporating useful things like matrices and bitwise ops

Scuttlebutt is aP2P-flavouredDIY social media with critical mass amongst a certain type of cryptopunk. To be precise, “Scuttlebutt” is shorthand for a complex ecology of pieces making up the “scuttleverse”, most of which, as consumers, we can ignore.

Influential developerAndré Staltz explains the value proposition. The flagship applications seem to be

- Manyverse. (open and running mobile app for ios and android. IOs app is a little crashy.)
- planetary. (up and running)
- patchfox (browser extension)
- ssbc/patchwork: A decentralized messaging and sharing app built on top of Secure Scuttlebutt (SSB).

This started as the übergeek social network for survivalists. Run it from your bugout yacht after a climate apocalypse, while malevolent totalitarian states try to censor your messages and steal your stockpiled tinned food and/or vaccinate you with singulitarian COVID nanobots! Explicitly:

Scuttlebutt is decentralized in a similar way that Bitcoin or BitTorrent are. Unlike centralized systems like PayPal or Dropbox, there is no single website or server to connect when using decentralized services. Which in turn means there is no single company with control over the network.

However, Scuttlebutt differs from Bitcoin and BitTorrent because there are no “singleton components” in the network. When accessing the BitTorrent network, for instance, you need to connect to a Distributed Hash Table [for which] you need to connect to a bootstrapping server [and] still depend on the existence of ISPs and the internet backbone. …

Secure Scuttlebutt is also different to federated social networks like Mastodon, Diaspora, GNU social, OStatus. Those technologies are not peer-to-peer, because each component is either a server or a client, but not both. Federated social networks are slightly better than centralized services like Facebook because they provide some degree of choice where your data should be hosted. However, there is still trust and dependency on third-party servers and ISPs, which makes it possible for administrators of those to abuse their power, through content policies, privacy violations or censorship.

In Scuttlebutt, the “mesh” suffices. With simply two computers, a local router, and electricity, you can exchange messages between the computers with minimal effort and no technical skills. Each account in Scuttlebutt is a diary (or “log”) of what a person has publicly and digitally said. As those people move around between different WiFi / LAN networks, their log gets copy-pasted to different computers, and so digital information spreads.

What word of mouth is for humans, Scuttlebutt is for social news feeds. It is unstoppable and spreads fast.

In practical terms: the main backend bit, which users can probably ignore, is the distributed data store,scuttlebot. On top of that you run user-facing apps likepatchwork (the proof-of-concept reference app), Manyverse or Planetary.There are many although not all are sustainable or viable.

~~A cypherlink to my experimental scuttlebutt profile:~~`@GccFBnmWOl2IB5l1rJjEZX9J4T8jLgmDQcAON5mzBOY=.ed25519`

although I have not used it often because the sync has been too slow for any real conversation.

SSB does not replace twitter.Staltz:

Let's be frank, we're building decentralized social systems, and if you've been following this space recently, you probably know about Bluesky or Nostr. In particular, Bluesky in particular has been getting a lot of attention, and I think it's important to know what goals PPPPP is trying to achieve versus what goals Bluesky or Nostr are trying to achieve.

SSB is not a very good replacement for Twitter, and neither will PPPPP be. On the other hand, it seems like Bluesky and Nostr are obvious Twitter alternatives. So why are we building PPPPP? What is the point?

One thing is social media, a place where anyone can join any conversation (this is a feature as much as it's a bug!), a place where you talk at people and build reputation or viral content, and a dangerous place that puts people at risk of seeing offensive content or direct harassment.

Another thing are social networks, a place where only people who know each other participate in conversations (this is a feature as much as it's a bug!), a place where you talk to people and build relationships, and a relatively safe place defined by people you know and acquaintances.

I've been following how Bluesky is evolving, and while technically impressive, built a brilliant team who is closely familiar with SSB, I am not excited by the prospect of decentralized social media. They are having to hire content reviewers, do a lot of centralized moderation, and both Nostr and Bluesky are easily flooded by a wall of unsolicited nudes. This is an inherit property of social media, be it centralized or decentralized

Rooms are where clients can meet, handshake and exchange data. A room is a minimal viable internet presence for connecting disparate clients via an always-on node.

Explanation available inAnnouncing: SSB Rooms. The original software mentioned in that postis broken in various ways.. Instead, usessb-ngi-pointer/go-ssb-room: Room server implemented in Go

Alternatively, one could use a VPN tunnel to connect to an always-on standard client on a device that was not on the internet e.g. on a home network.

- klarkc/ssb-bot-feed: Scuttlebutt bot that read RSS feeds and post updates automatically
- marine-master/ssb-bot-feed: Scuttlebutt bot that read RSS feeds and post updates automatically
- klarkc/ssb-bot-feed: Scuttlebutt bot that read RSS feeds and post updates automatically
- Docker image/sanity check/tutorial.
- PeachCloud
- fraction/oasis: Free, open-source, peer-to-peer social application that helps you follow friends and discover new ones on Secure Scuttlebutt (SSB).

**tl;dr**
I do a lot of data processing, and not so much running of websites and such.
This is not the typical target workflow for a database, at least not as they usually imagine at enterprise database focus groups.
So, here are some convenient databases for my needs: working at a particular,
sub-Google, scale, where my datasets are a few gigabytes but never a few
terabytes, and capturing stuff like experiment data, data processing pipelines and that kind of thing.
Not covered: processing credit card transactions, running websites, etc.

Short list of things I have used for various purposes:

- Pre-sharded or non-concurrent-write datasets too big for RAM?
Do not bother with a database
*per se*. Instead stash the stuff in somefiles formatted as e.g.`hdf5`

. Annoying schema definition needed, but once that’s done it’s fast for numerical data. Not a database*application*per se, but astructured data format that fills a similar niche. - Pre-sharded or non-concurrent-write datasets:
`sqlite`

. Has great tooling. A pity that it’s not super-high performance for numerical data (but for structured, relational data it is pretty good) - Concurrent but not incessant writes (i.e. I just want to manage my processes):
Maybedogpile.cache
or
`joblib`

, or perhaps useberkeley DB which has lock support. - Concurrent frequent writes:redis.
- honourable mention: If I want a low-fuss searchable index over some structured content from a python script,TinyDB!
- Honourable mention: Pre-sharded or non-concurrent-write datasets between python/R/Go/Rust/JS/etc
`arrow`

. This is used in all kinds of fancy tooling, but I have not needed any of it yet. - Sometimes I want to quickly browse a database to see what is in it. For that I useDB UIs.
- I want to share my data:Data sharing

Point of contact here: Data Lakes, big stores of (usually?) somewhat tabular data which are not yet normalized into data tables and are implicitly large. I don’t know much about those but see, e.g.Introducing Walden.

Maybe one could get a bit of perspective on the tools here by write-ups such as Luc Perkins’sRecent database technology that should be on your radar.

OK, longer notes begin here:

With a focus on slightly specialised data stores for use in my statistical jiggery-pokery. Which is to say: I care about analysis of lots of data fast. This is probably inimical to running, e.g. your webapp from the same store, which has different requirements. (Massively concurrent writes, consistency guarantees, many small queries instead of few large) Don’t ask me about that.

I would prefer to avoid running a database*server* at all if I can;
At least in the sense of a highly specialized multi-client server process.
Those are not optimised for a typical scientific workflow.
First stop is in-process non-concurrent-write data storage
e.g. HDF5 or sqlite.

However, if I want to mediate between lots of threads/processes/machines updating my data in parallel, a “real” database server might be justified.

OTOH if my data is big enough, perhaps I need a crazy giant distributed store of some kind? Requirements change vastly depending on scale.

Unless my data is enormous, or I need to write to it concurrently, this is what I want, because

- no special server process is required and
- migrating data is just copying a file

But how to encode the file? Seedata formats.

Want to handle floppy ill-defined documents of ill-specified possibly changing
metadata? Already resigned to the process of querying and processing this stuff
being depressingly slow and/or storage-greedy?
I must be looking for*document stores*!

If I am looking at document stores as my primary workhorse, as opposed to something I want to get data out of for other storage, then I have

- Not much data so performance is no problem, or
- a problem, or
- a big engineering team.

Let’s assume number 1, which is common for me.

`Mongodb`

has a pleasant JS api but is notoriously not all that
good at concurrent storage.
If my data is effectively single-writer I could just be
doing this from the filesystem. Still I can imagine scenarios where the
dynamic indexing of post hoc metadata is nice, for example in the
exploratory phase with a data subset?

Since it went closed-source, it is worth knowing there is an open-source alternative,FerretDB 1.0 GA.

`Couchdb`

was the pin-up child of the current crop
of non SQL-based databases, but seems to be unfashionable rn?
A big ecosystem of different implementations of the core DB for different purposes, all of which promise to support replication into eventually-consistent clusters.

`kinto`

“is a lightweight JSON
storage service with synchronisation and sharing abilities. It is meant to
be easy to use and easy to self-host. Supports fine permissions, easy
host-proof encryption, automatic versioning for device sync.”

I can imagine distributed data analysis applications.

`lmdb`

looks interesting if you want
a simple store that just guarantees
you can write to it without corrupting data, and without requiring a custom
server process. Most efficient for small records (2K).

LMDB is a tiny database with some excellent properties:

- Ordered map interface (keys are always lexicographically sorted).
- Reader/writer transactions: readers don’t block writers, writers don’t block readers. Each environment supports one concurrent write transaction.
- Read transactions are extremely cheap.
- Environments may be opened by multiple processes on the same host, making it ideal for working around Python’sGIL.
- Multiple named databases may be created with transactions covering all named databases.
- Memory mapped, allowing for zero copy lookup and iteration. This is optionally exposed to Python using the
`buffer()`

interface.- Maintenance requires no external process or background threads.
- No application-level caching is required: LMDB fully exploits the operating system’s buffer cache.

?

- UNSTRUCTURED DATA
- QMiner provides support for unstructured data, such as text and social networks across the entire processing pipeline, from feature engineering and indexing to aggregation and machine learning.
- SEARCH
- QMiner provides out-of-the-box support for indexing, querying and aggregating structured, unstructured and geospatial data using a simple query language.
- JAVASCRIPT API
- QMiner applications are implemented in JavaScript, making it easy to get started. Using the Javascript API it is easy to compose complete data processing pipelines and integrate with other systems via RESTful web services.
- C++ LIBRARY
- QMiner is implemented in C++ and can be included as a library into custom C++ projects, thus providing them with stream processing and data analytics capabilities.

berkeley is a venerable key-value store that is no longer fashionable. However it is efficient for storing binary data, and supports multi-process concurrency via lock files, all without using a server process. As such it may be useful for low-fuss HPC data storage and processing. There are, e.g.python bindings.

Long lists of numbers? Spreadsheet-like tables?
Wish to do queries mostly of the sort supported by database engines,
such as grouping, sorting and range queries?
First stop is`Sqlite`

, if it fits in memory, in the sense of the bit I am mostly using mostly fitting in memory.
Note that if I have tabular data but do not particularly wish to perform diverse RDBMS-style queries, then I should just useHDF5 or some other simple disk data store.

🏗 how to write safely to sqlite from multiple processes through write locks. Also: See Mark Litwintschik’sMinimalist Guide to SQLite.

If not, or if I need to handle concurrent writing by multiple processes, we need one of the classic RDBMS servers, e.g. MySQL or Postgres. Scientific use cases are not usually like this; we are not usually concurrently generating lots of data.

TBD

MariaDB server is a community developed fork of MySQL server. Started by core members of the original MySQL team, MariaDB actively works with outside developers to deliver the most featureful, stable, and sanely licensed open SQL server in the industry

Classic. What I tended to use because it has powerful embedded scripting and good support for spatial data.

Huawei’s postgres fork isopenGauss

Maybe we can make numerical work easier using`Blaze`

?

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.

More generally,`records`

, which wraps`tablib`

andsqlalchemy,
are all good at this.

Julia Evanpoints outsqlite-utils, and tool that magically convertsJSON to sqlite.

Also covered underdata versioning

Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a git repository. Connect to Dolt just like any MySQL database to run queries or update the data using SQL commands. Use the command line interface to import CSV files, commit your changes, push them to a remote, or merge your teammate’s changes.

All the commands you know for Git work exactly the same for Dolt. Git versions files, Dolt versions tables. It’s like Git and MySQL had a baby.

We also builtDoltHub, a place to share Dolt databases. We host public data for free. If you want to host your own version of DoltHub, we haveDoltLab. If you want us to run a Dolt server for you, we haveHosted Dolt.

Ever since google, every CS graduate wants to write one of these. There are dozens of options; you probably need none of them.

I have used non of them and only mention them here to keep them straight in my head.

Hbase for Hadoop (original hip open source one, no longer hip)

Hypertable is Baidu’s open competitor to google internal database

[…] is a networking and distributed transaction layer built atop SQLite, the fastest, most reliable, and most widely distributed database in the world.

Bedrock is written for modern hardware with large SSD-backed RAID drives and generous RAM file caches, and thereby doesn’t mess with the zillion hacky tricks the other databases do to eke out high performance on largely obsolete hardware. This results in fewer esoteric knobs, and sane defaults that “just work”.

datalog seems to be a protocol/language designed for largish stores, with implementations such asdatomic getting good press for being scalable. Readthis tutorial and explain it to me.

Build flexible, distributed systems that can leverage the entire history of your critical data, not just the most current state. Build them on your existing infrastructure or jump straight to the cloud.

`orbitdb`

Not necessarily giant (I mean, I don’t know the how it scales) but convenient for offline/online syncing and definitely distributed,`orbitdb`

uses`ipfs`

for its backend.

`redis`

andmemcached are the
default generic choices here. Redis is newer and
more flexible. memcached is sometimes faster? Dunno.
Perhaps seeWhy Redis beats Memcached for caching.

Seepython caches for the practicalities of doing this for one particular languages.

Graph-tuple oriented processing.

GE is also a flexible computation engine powered by declarative message passing. GE is for you, if you are building a system that needs to perform fine-grained user-specified server-side computation.

From the perspective of graph computation, GE is not a graph system specifically optimized for a certain graph operation. Instead, with its built-in data and computation modeling capability, we can develop graph computation modules with ease. In other words, GE can easily morph into a system supporting a specific graph computation.

Nebula Graph is an open-source graph database capable of hosting super large scale graphs with dozens of billions of vertices (nodes) and trillions of edges, with milliseconds of latency.

There are a lot more of these. Everyone is inventing new graph stores at the moment.

immudb is a

lightweight, high-speed immutable databasefor systems and applications, written in Go. With immudb you can track changes in sensitive data in your transactional databases and then record those changes permanently in a tamperproof immudb database. This allows you to keep an indelible history of sensitive data, for example debit/credit card transactions.Traditional transaction logs are hard to scale and are mutable. So there is no way to know for sure if your data has been compromised.

As such, immudb provides

unparalleled insightsretroactivelyof changes to your sensitive data, even if your perimeter has been compromised. immudb guarantees immutability by using aMerkle tree structureinternally.immudb gives you the same

cryptographic verificationof the integrity of data written withSHA-256as a classic blockchain without the cost and complexity associated with blockchains today.

TileDB is a DB built aroundmulti-dimensional arrays that enables you to easily work with types that aren’t a great fit for existing RDBMS systems, such asdense and sparse arrays anddataframes. TileDB is specifically geared toward use cases likegenomics andgeospatial data.

## Noteworthy features

Self-learning database: NoisePage

NoisePage is a relational database management system (DBMS) designed from the ground up for autonomous deployment. It uses integrated machine learning components to control its configuration, optimization, and tuning. The system will support automated physical database design (e.g., indexes, materialized views, sharding), knob configuration tuning, SQL tuning, and hardware capacity/scaling. Our research focuses on building the system components that support such self-driving operation with little to no human guidance.

I believe that it does notvwork yet.

Logica is for engineers, data scientists and other specialists who want to use logic programming syntax when writing queries and pipelines to run onBigQuery.

Logica compiles to StandardSQL and gives you access to the power of BigQuery engine with the convenience of logic programming syntax. This is useful because BigQuery is magnitudes more powerful than state of the art native logic programming engines.

We encourage you to try Logica, especially if

- you already use logic programming and need more computational power,
or- you use SQL, but feel unsatisfied about its readability,
or- you want to learn logic programming and apply it to processing of Big Data.
In the future we plan to support more SQL dialects and engines.

Clickhouse for example is a*columnar* database that avoids
some of the problems
of row-oriented tabular databases. I guess you could try that?
And Amazon Athena turns arbitrary data into SQL-queryable data, apparently.
So the skills here are general.

Columnar in-process DB:

Avoiding corporate spying in the web. The browser mediates a large portion of my interaction with the internet, so I should make sure it is ship shape, and specifically, that it is not leaking my info everywhere.

Blacklight realtime privacy inspector.I Scanned the Websites I Visit with Blacklight, and It’s Horrifying. Now What?

Nearly all websites use tracking technologies to collect data about you. By law, they often need your permission, which is why many websites have "consent pop-ups". However,90% of these pop-ups use so-called "dark patterns", which are designed to make it very difficult to say no, but very easy to say yes. Although using dark patterns is illegal, the laws are not enforced enough, so many websites get away with it.

Consent-O-Maticis a browser extension that recognizes CMP (Consent Management Provider) pop-ups that have become ubiquitous on the web and automatically fills them out based on your preferences -- even if you meet a dark pattern design. Sometimes a website might not use standard categories, and in that case, Consent-O-Matic will always try to submit the most privacy preserving settings.

Use a password manager. It is easy, free and saves time.

To take control of my identity online I use Privacy Possum,uBlock Origin, and ClearURLs in theFirefox browser which is IMO the best browser. This is a good level of fussiness for an obsessive tinkerer like me. Sometimes I use the Brave browser instead of Firefox because of a website quirk that doesn’t work in Firefox.

I tried a lot of things before settling on these tools; some of the other options might be of interest.

Privacy possum aims to be a successor to Privacy Badger which is more aggressive and (the creator argues) remedies certain shortcoming in Privacy Badger. The argument is something like “let us raise the cost of tracking people and consider ourselves successful if it is probably too expensive to bother”.

ClearURLs removes tracking crap from your URLs

Privacy badger is an open source non-profit low-configuration blocker of tracking advertisers

Startpage Privacy Protection Extension might be good but I am nervous about it because I cannot find the source code even though they saynice things

scriptsafe offers aggressive no frills script blocking.

Thebrowser plugs suite comprises various browser plugs that hinder fingerprinting of the unique features of your browser.

Fuzzify automates and monitors clicking on the “delete my ad data” button in facebook.

adblock plus is a ublock origin alternative. Better business model but AFAICT a worse product.

torbrowser bundles all the ad-blocking conceivable, although it also makes browsing unpleasant and slow. There is some kind of lesson there.

Ghostery disables most of the social media spyware, although its process a little opaque.

uBlock Origin is an adblocker and general tracking blocker with acomplicated history which we can mostly ignore.
(Except, pro-tip, ublock.org is nothing to do with ublock origin).
It has a*semi pro* feel, being not quite as polished as its commercial cousins but also more configurable.
Some people prefer the somewhat smoother but also compromise-filled Adblock plus.

NB: Itworks best on firefox. That essay is also an interesting insight into various superior firefox features. theymay be an endangered species.

The sweet spot for me ismedium mode which I find gives me the freedom me to tweak glitches I see in*easy mode* but also not freak out with choice paralysis like in*hard mode*.

There is a discontinued (?) alternative by the same author calledumatrix which I find offers way too many choices for a sane person.

ublock origin also comes with a handyelement zapper mode which I use toeliminate distractions

HTTPS everywhere is vexing.
It is a mass of code that plasters over certain security holes caused by the continued existence of HTTP-and-Secure-HTTP in parallel.
Which sounds fine — does*everything* need to be encrypted?
Well, no, IMO, but while swapping between secure and insecure modes is an option it means that some things that do need to be encrypted are not.

Effectively, security-optional leads to writing your passwords on the lawn in big letters any time someone asks. But don’t take my word for it— see how this was used inthe PoisonTap attack.

This is being gradually rendered irrelevant by some network technology called HSTS; hopefullywe can forget it soon.

In the interimwe can switch off insecure mode:

Firefox: Settings > Privacy & Security > Scroll to Bottom > Enable HTTPS-Only Mode

Chrome: Settings > Privacy and security > Security > Scroll to bottom > Toggle “Always use secure connections”

Seeinternet search.

“Private Browsing mode revised and improved”. Firefoxmulti-user-containers are one low friction option; they compartmentalise our different online activities from each other so that each website lives in its own solipsist universe. These have obvious privacy implications — keep all your sites isolated from one another! Why does google need to know about your facebook usage? They are alsogenerally useful.

For example, if a site such as medium.com constantly nags you to become a member after you have read 2 articles in the same month, create a new browser container, and get two more free article.

I could use a*Single-site browser* for spyware sites such as Facebook. because

- Otherwise Facebook would know even more about me than they do
- Facebook is a blackhole of timewaste that I don’t want to browse to by accident, so I should make it slightly easier to segregate that activity from other ones.

You can do this too, for social media, or for whatever other website you wish to.

nativefier/nativefier: Make any web page a desktop application is I think the mos popular method currently? Cross platform.

Epichrome (macOS): An application (Epichrome.app) and Chrome extension (Epichrome Helper) to create and use Chrome-based SSBs on macOS. So, full Chrome, custom configuration.Here is a walk-through.

MacPin (macOS)

creates macOS & iOS apps for websites & webapps, configured with JavaScript.

The Browser UI is very minimal, just a toolbar (with site tabs) that disappears in Full-Screen mode.

MacPin apps are shown in macOS’s Dock, App Switcher, and Launchpad.

Custom URL schemes can also be registered to launch a MacPin App from any other app on your Mac.

So, minimal browserlets.

There are more manual methods.

- How To Turn Chrome or Firefox Into A Single-Site Browser.
- Making Firefox into a “Single-Site Browser”
- Create applications shortcuts in Google Chrome for Macs with a shell script

Generally, I find the browser container feature of firefox much easier.

Choosy: A smarter default browser for macOS lets us switch between these for different purposes easily.

Left-field solution idea : Obfuscate your activity. Get your browser to do meaningless nonsense that obscure the patterns of your behaviour I would be curious to know how effective that is, or even how one would discover how effective that is. I am not hopeful that this works, which is why it is at the top of the page, but it is an interesting idea.

Random noise extensions attempt to make your browsing data useless to trackers by making your browser mindlessly visit lots of nonsense sites, thus confusing the paper trail.noiszy, does for news consumption.trackmenot does this for search queries.AdNauseam is the latest one:

AdNauseam works to complete the cycle by automating ad clicks universally and blindly on behalf of its users. Built atop uBlock Origin, AdNauseam quietly clicks on every blocked ad, registering a visit on ad networks’ databases. As the collected data gathered shows an omnivorous click-stream, user tracking, targeting and surveillance become futile. Read more about AdNauseam in thispaper.

Some browsers claim to be privacy first.

- Firefox issafer than chrome per default and easy to configure to be even more secure. This is what I use.
- DuckDuckGo Browser claims to be secure.
- Brave is a browser whichclaims to eliminate most tracking except for consensual-opt-in privacy-compatible tracking. I have many questions about that, but it is worth a try. It has acryptocurrency bent and is more closed-source than the alternatives.

Predicting with competence: the best machine learning idea you never heard of from renowned passive-aggressive grumpy bastard Scott Locklin (Sorry Scott, but you are so reliably objectionable that I am always going to need to put a disclaimer on links to you, why do you refer to female scientists as “this woman”?):

The essential idea is that a “conformity function” exists. Effectively you are constructing a sort of multivariate cumulative distribution function for your machine learning gizmo using the conformity function. Such CDFs exist for classical stuff like ARIMA and linear regression under the correct circumstances; CP brings the idea to machine learning in general, and to models like ARIMA when the standard parametric confidence intervals won’t work.

Cosma Shalizi recommends Samii’sConformal Inference Tutorial andLei et al. (2017), because he feltVovk, Gammerman, and Shafer (2005) wasbadly written. Maybe Shafer’s tutorial is good?(Shafer and Vovk 2008). Modern takes inAlvarsson et al. (2021);Zeni, Fontana, and Vantini (2020) andA Tutorial on Conformal Prediction plus accompanyingvideo(Angelopoulos and Bates 2022).

Emmanuel Candés’Neurips keynote on Conformal Prediction in 2022 was good.

Question: how does conformal predication work underdataset shift(Tibshirani et al. 2019;Barber et al. 2023)?

Alvarsson, Jonathan, Staffan Arvidsson McShane, Ulf Norinder, and Ola Spjuth. 2021.“Predicting With Confidence: Using Conformal Prediction in Drug Discovery.”*Journal of Pharmaceutical Sciences* 110 (1): 42–49.

Angelopoulos, Anastasios N., and Stephen Bates. 2022.“A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification.” arXiv.

———. 2023.“Conformal Prediction: A Gentle Introduction.”*Foundations and Trends® in Machine Learning* 16 (4): 494–591.

Barber, Rina Foygel, Emmanuel J. Candes, Aaditya Ramdas, and Ryan J. Tibshirani. 2023.“Conformal Prediction Beyond Exchangeability.” arXiv.

Barber, Rina Foygel, Emmanuel J. Candès, Aaditya Ramdas, and Ryan J. Tibshirani. 2021.“Predictive Inference with the Jackknife+.”*The Annals of Statistics* 49 (1): 486–507.

Bastani, Osbert, Varun Gupta, Christopher Jung, Georgy Noarov, Ramya Ramalingam, and Aaron Roth. 2022.“Practical Adversarial Multivalid Conformal Prediction.” arXiv.

Card, Dallas, Michael Zhang, and Noah A. Smith. 2019.“Deep Weighted Averaging Classifiers.” In*Proceedings of the Conference on Fairness, Accountability, and Transparency*, 369–78.

Efron, Bradley. 2021.“Resampling Plans and the Estimation of Prediction Error.”*Stats* 4 (4): 1091–1115.

Fontana, Matteo, Gianluca Zeni, and Simone Vantini. 2023.“Conformal Prediction: A Unified Review of Theory and New Challenges.”*Bernoulli* 29 (1): 1–23.

Gibbs, Isaac, and Emmanuel Candès. 2022.“Conformal Inference for Online Prediction with Arbitrary Distribution Shifts.” arXiv.

Hu, Yuge, Joseph Musielewicz, Zachary W Ulissi, and Andrew J Medford. 2022.“Robust and Scalable Uncertainty Estimation with Conformal Prediction for Machine-Learned Interatomic Potentials.”*Machine Learning: Science and Technology* 3 (4): 045028.

Lei, Jing, Max G’Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. 2017.“Distribution-Free Predictive Inference For Regression.” arXiv.

Norinder, Ulf, Lars Carlsson, Scott Boyer, and Martin Eklund. 2014.“Introducing Conformal Prediction in Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain Determination.”*Journal of Chemical Information and Modeling* 54 (6): 1596–1603.

Romano, Yaniv, Evan Patterson, and Emmanuel Candes. 2019.“Conformalized Quantile Regression.” In*Advances in Neural Information Processing Systems*. Vol. 32. Curran Associates, Inc.

Shafer, Glenn, and Vladimir Vovk. 2008.“A Tutorial on Conformal Prediction.”*Journal of Machine Learning Research* 9 (12): 371–421.

Tibshirani, Ryan J, Rina Foygel Barber, Emmanuel Candes, and Aaditya Ramdas. 2019.“Conformal Prediction Under Covariate Shift.” In*Advances in Neural Information Processing Systems*. Vol. 32. Curran Associates, Inc.

Vovk, Vladimir, Alex Gammerman, and Glenn Shafer. 2005.*Algorithmic Learning in a Random World*. Springer Science & Business Media.

Vovk, Vladimir, Ilia Nouretdinov, and Alexander Gammerman. 2009.“On-Line Predictive Linear Regression.”*The Annals of Statistics* 37 (3): 1566–90.

Zeni, Gianluca, Matteo Fontana, and Simone Vantini. 2020.“Conformal Prediction: A Unified Review of Theory and New Challenges.”*arXiv:2005.07972 [Cs, Econ, Stat]*, May.

Optimising for an objective defined as weighted sum of multiple objectives of unknown weights can be difficult. Useful inmulti task learning, for example, or inweighting regularisation in regression includingneural nets.

HTCheng Soon Ong for pointing out Jonas Degrave and Ira Korshunova’s illustrated explanation of a tricky thing,Why machine learning algorithms are hard to tune (and the fix). His summary::

Machine learning hyperparameters are hard to tune. One way to think of why it is hard, is because it is a Pareto front of multiple objectives. One way to solve that problem is to look at Lagrange multipliers, as proposed by a paper in 1988(Platt and Barr 1987).

A follow up post describeshow we can make machine learning algorithms tunable.

Das, Indraneel, and John E. Dennis. 1997.“A Closer Look at Drawbacks of Minimizing Weighted Sums of Objectives for Pareto Set Generation in Multicriteria Optimization Problems.”*Structural Optimization* 14 (1): 63–69.

Jakob, Wilfried, and Christian Blume. 2014.“Pareto Optimization or Cascaded Weighted Sum: A Comparison of Concepts.”*Algorithms* 7 (1): 166–85.

Kim, Il Yong, and O. L. De Weck. 2006.“Adaptive Weighted Sum Method for Multiobjective Optimization: A New Method for Pareto Front Generation.”*Structural and Multidisciplinary Optimization* 31 (2): 105–16.

Kim, Il Yong, and Oliver L. De Weck. 2005.“Adaptive Weighted-Sum Method for Bi-Objective Optimization: Pareto Front Generation.”*Structural and Multidisciplinary Optimization* 29 (2): 149–58.

Marler, R., and Jasbir Arora. 2010.“The Weighted Sum Method for Multi-Objective Optimization: New Insights.”*Structural and Multidisciplinary Optimization* 41 (6): 853–62.

Platt, John C., and Alan H. Barr. 1987.“Constrained Differential Optimization.” In*Proceedings of the 1987 International Conference on Neural Information Processing Systems*, 612–21. NIPS’87. Cambridge, MA, USA: MIT Press.

Ryu, Jong-hyun, Sujin Kim, and Hong Wan. 2009.“Pareto Front Approximation with Adaptive Weighted Sum Method in Multiobjective Simulation Optimization.” In*Proceedings of the 2009 Winter Simulation Conference (WSC)*, 623–33. IEEE.

Doing inference where theprobability metric measuring discrepancy between some target distribution and the implied inferential distribution is anoptimal-transport one. Frequently intractable, but neat when we can get it. Sometimes we might get there by estimating the (gradients of) an actual OT loss, or even thetransport maps implying that loss

WassersteinGANs and OT Gans(Salimans et al. 2018) are argued to do an approximate optimal transport inference, indirectly.

See e.g.(J. H. Huggins et al. 2018b,2018a) for a particular Bayes posterior approximation using OT.

Daniel Daza inApproximating Wasserstein distances with PyTorch touches uponFatras et al. (2020):

Optimal transport distances are powerful tools to compare probability distributions and have found many applications in machine learning. Yet their algorithmic complexity prevents their direct use on large scale datasets. To overcome this challenge, practitioners compute these distances on minibatches i.e., they average the outcome of several smaller optimal transport problems. We propose in this paper an analysis of this practice, which effects are not well understood so far. We notably argue that it is equivalent to an implicit regularization of the original problem, with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with defects such as loss of distance property.

I learned about it fromBai et al. (2023) viaCheng-Soon Ong

Comparing K (probability) measures requires the pairwise calculation of transport-based distances, which, despite the significant recent computational speed-ups, remains to be relatively expensive. To address this problem,W. Wang et al. (2013) proposed the Linear Optimal Transport (LOT) framework, which linearizes the 2-Wasserstein distance utilizing its weak Riemannian structure. In short, the probability measures are embedded into the tangent space at a fixed reference measure (e.g., the measures’ Wasserstein barycenter) through a logarithmic map. The Euclidean distances between the embedded measures then approximate the 2-Wasserstein distance between the probability measures. The LOT framework is computationally attractive as it only requires the computation of one optimal transport problem per input measure, reducing the otherwise quadratic cost to linear. Moreover, the framework provides theoretical guarantees on convexifying certain sets of probability measures […], which is critical in supervised and unsupervised learning from sets of probability measures.

Optimal Transport Tools (OTT)(Cuturi et al. 2022), a toolbox for all things Wasserstein (documentation):

The goal of OTT is to provide sturdy, versatile and efficient optimal transport solvers, taking advantage of JAX features, such asJIT,auto-vectorization andimplicit differentiation.

A typical OT problem has two ingredients: a pair of weight vectors

`a`

and`b`

(one for each measure), with a ground cost matrix that is either directly given, or derived as the pairwise evaluation of a cost function on pairs of points taken from two measures. The main design choice in OTT comes from encapsulating the cost in a`Geometry`

object, and [bundling] it with a few useful operations (notably kernel applications). The most common geometry is that of two clouds of vectors compared with the squared Euclidean distance, as illustrated in the example below:

```
import jax
import jax.numpy as jnp
from ott.tools import transport
# Samples two point clouds and their weights.
rngs = jax.random.split(jax.random.PRNGKey(0),4)
n, m, d = 12, 14, 2
x = jax.random.normal(rngs[0], (n,d)) + 1
y = jax.random.uniform(rngs[1], (m,d))
a = jax.random.uniform(rngs[2], (n,))
b = jax.random.uniform(rngs[3], (m,))
a, b = a / jnp.sum(a), b / jnp.sum(b)
# Computes the couplings via Sinkhorn algorithm.
ot = transport.solve(x, y, a=a, b=b)
P = ot.matrix
```

The call to

`sinkhorn`

above works out the optimal transport solution by storing its output. The transport matrix can be instantiated using those optimal solutions and the`Geometry`

again. That transport matrix links each point from the first point cloud to one or more points from the second, as illustrated below.To be more precise, the

`sinkhorn`

algorithm operates on the`Geometry`

, taking into account weights`a`

and`b`

, to solve the OT problem, produce a named tuple that contains two optimal dual potentials`f`

and`g`

(vectors of the same size as`a`

and`b`

), the objective`reg_ot_cost`

and a log of the`errors`

of the algorithm as it converges, and a`converged`

flag.

POT: Python Optimal Transport(Rémi Flamary et al. 2021)

This open source Python library provide several solvers for optimization problems related to Optimal Transport for signal, image processing and machine learning.

Website and documentation:https://PythonOT.github.io/

Source Code (MIT):https://github.com/PythonOT/POT

POT provides the following generic OT solvers (links to examples):

- OT Network Simplex solver for the linear program/ Earth Movers Distance .
- Conditional gradient andGeneralized conditional gradient for regularized OT .
- Entropic regularization OT solver withSinkhorn Knopp Algorithm , stabilized version , greedy Sinkhorn andScreening Sinkhorn.
- Bregman projections forWasserstein barycenter ,convolutional barycenter and unmixing .
- Sinkhorn divergence and entropic regularization OT from empirical data.
- Debiased Sinkhorn barycentersSinkhorn divergence barycenter
- Smooth optimal transport solvers (dual and semi-dual) for KL and squared L2 regularizations .
- Weak OT solver between empirical distributions
- Non regularizedWasserstein barycenters with LP solver (only small scale).
- Gromov-Wasserstein distances andGW barycenters (exact and regularized ), differentiable using gradients from Graph Dictionary Learning
- Fused-Gromov-Wasserstein distances solver andFGW barycenters
- Stochastic solver anddifferentiable losses for Large-scale Optimal Transport (semi-dual problem and dual problem )
- Sampled solver of Gromov Wasserstein for large-scale problem with any loss functions
- Non regularizedfree support Wasserstein barycenters .
- One dimensional Unbalanced OT with KL relaxation andbarycenter\[10, 25\]. Alsoexact unbalanced OT with KL and quadratic regularization and theregularization path of UOT
- Partial Wasserstein and Gromov-Wasserstein (exact and entropic formulations).
- Sliced Wasserstein\[31, 32\] and Max-sliced Wasserstein that can be used for gradient flows .
- Graph Dictionary Learning solvers .
- Several backends for easy use of POT withPytorch/jax/Numpy/Cupy/Tensorflow arrays.
POT provides the following Machine Learning related solvers:

- Optimal transport for domain adaptation withgroup lasso regularization,Laplacian regularization andsemi supervised setting.
- Linear OT mapping andJoint OT mapping estimation .
- Wasserstein Discriminant Analysis (requires autograd + pymanopt).
- JCPOT algorithm for multi-source domain adaptation with target shift .
Some other examples are available in thedocumentation.

The

GeomLosslibrary provides efficient GPU implementations for:

- Kernel norms (also known asMaximum Mean Discrepancies).
- Hausdorff divergences, which are positive definite generalizations of theChamfer-ICP loss and are analogous to
log-likelihoodsof Gaussian Mixture Models.- Debiased Sinkhorn divergences, which are affordable yet
positive and definiteapproximations ofOptimal Transport (Wasserstein) distances.It is hosted onGitHub and distributed under the permissiveMIT license.

GeomLoss functions are available through the customPyTorch layers

`SamplesLoss`

,`ImagesLoss`

and`VolumesLoss`

which allow you to work with weightedpoint clouds(of any dimension),density mapsandvolumetric segmentation masks.

Rigollet and Weed (2018):

We give a statistical interpretation of entropic optimal transport by showing that performing maximum-likelihood estimation for Gaussian deconvolution corresponds to calculating a projection with respect to the entropic optimal transport distance.

Thomas Viehmann,An efficient implementation of the Sinkhorn algorithm for the GPU is a Pytorch CUDA extension(Viehmann 2019)

Marco Cuturi’s course notes on OT include a 400 page slide deck.

Agueh, Martial, and Guillaume Carlier. 2011.“Barycenters in the Wasserstein Space.”*SIAM Journal on Mathematical Analysis* 43 (2): 904–24.

Alaya, Mokhtar Z., Maxime Berar, Gilles Gasso, and Alain Rakotomamonjy. 2019.“Screening Sinkhorn Algorithm for Regularized Optimal Transport.”*Advances in Neural Information Processing Systems* 32.

Altschuler, Jason, Jonathan Niles-Weed, and Philippe Rigollet. n.d.“Near-Linear Time Approximation Algorithms for Optimal Transport via Sinkhorn Iteration,” 11.

Ambrogioni, Luca, Umut Güçlü, Yagmur Güçlütürk, Max Hinne, Eric Maris, and Marcel A. J. van Gerven. 2018.“Wasserstein Variational Inference.” In*Proceedings of the 32Nd International Conference on Neural Information Processing Systems*, 2478–87. NIPS’18. USA: Curran Associates Inc.

Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savare. 2008.*Gradient Flows: In Metric Spaces and in the Space of Probability Measures*. 2nd ed. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel.

Angenent, Sigurd, Steven Haker, and Allen Tannenbaum. 2003.“Minimizing Flows for the Monge-Kantorovich Problem.”*SIAM Journal on Mathematical Analysis* 35 (1): 61–97.

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. 2017.“Wasserstein Generative Adversarial Networks.” In*International Conference on Machine Learning*, 214–23.

Arora, Sanjeev, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017.“Generalization and Equilibrium in Generative Adversarial Nets (GANs).”*arXiv:1703.00573 [Cs]*, March.

Bachoc, Francois, Alexandra Suvorikova, David Ginsbourger, Jean-Michel Loubes, and Vladimir Spokoiny. 2019.“Gaussian Processes with Multidimensional Distribution Inputs via Optimal Transport and Hilbertian Embedding.”*arXiv:1805.00753 [Stat]*, April.

Bai, Yikun, Ivan Medri, Rocio Diaz Martin, Rana Muhammad Shahroz Khan, and Soheil Kolouri. 2023.“Linear Optimal Partial Transport Embedding.”

Benamou, Jean-David. 2021.“Optimal Transportation, Modelling and Numerical Simulation.”*Acta Numerica* 30 (May): 249–325.

Benamou, Jean-David, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. 2014.“Iterative Bregman Projections for Regularized Transportation Problems.”*arXiv:1412.5154 [Math]*, December.

Berg, Rianne van den, Leonard Hasenclever, Jakub M. Tomczak, and Max Welling. 2018.“Sylvester Normalizing Flows for Variational Inference.” In*UAI18*.

Bishop, Adrian N., and Arnaud Doucet. 2014.“Distributed Nonlinear Consensus in the Space of Probability Measures.”*IFAC Proceedings Volumes*, 19th IFAC World Congress, 47 (3): 8662–68.

Blanchet, Jose, Lin Chen, and Xun Yu Zhou. 2018.“Distributionally Robust Mean-Variance Portfolio Selection with Wasserstein Distances.”*arXiv:1802.04885 [Stat]*, February.

Blanchet, Jose, Arun Jambulapati, Carson Kent, and Aaron Sidford. 2018.“Towards Optimal Running Times for Optimal Transport.”*arXiv:1810.07717 [Cs]*, October.

Blanchet, Jose, Yang Kang, and Karthyek Murthy. 2016.“Robust Wasserstein Profile Inference and Applications to Machine Learning.”*arXiv:1610.05627 [Math, Stat]*, October.

Blanchet, Jose, Karthyek Murthy, and Nian Si. 2019.“Confidence Regions in Wasserstein Distributionally Robust Estimation.”*arXiv:1906.01614 [Math, Stat]*, June.

Blondel, Mathieu, Vivien Seguy, and Antoine Rolet. 2018.“Smooth and Sparse Optimal Transport.” In*AISTATS 2018*.

Boissard, Emmanuel. 2011.“Simple Bounds for the Convergence of Empirical and Occupation Measures in 1-Wasserstein Distance.”*Electronic Journal of Probability* 16 (none).

Bonneel, Nicolas. n.d.“Displacement Interpolation Using Lagrangian Mass Transport,” 11.

Bonnotte, Nicolas. 2012.“From Knothe’s Rearrangement to Brenier’s Optimal Transport Map.” arXiv.

Canas, Guillermo D., and Lorenzo Rosasco. 2012.“Learning Probability Measures with Respect to Optimal Transport Metrics.”*arXiv:1209.1077 [Cs, Stat]*, September.

Carlier, Guillaume, Marco Cuturi, Brendan Pass, and Carola Schoenlieb. 2017.“Optimal Transport Meets Probability, Statistics and Machine Learning,” 9.

Carlier, Guillaume, Alfred Galichon, and Filippo Santambrogio. 2008.“From Knothe’s Transport to Brenier’s Map and a Continuation Method for Optimal Transport.” arXiv.

Chizat, Lenaic, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. 2017.“Scaling Algorithms for Unbalanced Transport Problems.”*arXiv:1607.05816 [Math]*, May.

Chu, Casey, Jose Blanchet, and Peter Glynn. 2019.“Probability Functional Descent: A Unifying Perspective on GANs, Variational Inference, and Reinforcement Learning.” In*ICML*.

Corenflos, Adrien, James Thornton, George Deligiannidis, and Arnaud Doucet. 2021.“Differentiable Particle Filtering via Entropy-Regularized Optimal Transport.”*arXiv:2102.07850 [Cs, Stat]*, June.

Coscia, Michele. 2020.“Generalized Euclidean Measure to Estimate Network Distances,” 11.

Courty, Nicolas, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. 2016.“Optimal Transport for Domain Adaptation.”*arXiv:1507.00504 [Cs]*, June.

Cuturi, Marco. 2013.“Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances.” In*Advances in Neural Information Processing Systems 26*.

Cuturi, Marco, and Arnaud Doucet. 2014.“Fast Computation of Wasserstein Barycenters.” In*International Conference on Machine Learning*, 685–93. PMLR.

Cuturi, Marco, Laetitia Meng-Papaxanthos, Yingtao Tian, Charlotte Bunne, Geoff Davis, and Olivier Teboul. 2022.“Optimal Transport Tools (OTT): A JAX Toolbox for All Things Wasserstein.”*arXiv Preprint arXiv:2201.12324*.

Fatras, Kilian, Younes Zine, Rémi Flamary, Remi Gribonval, and Nicolas Courty. 2020.“Learning with Minibatch Wasserstein : Asymptotic and Gradient Properties.” In*Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics*, 2131–41. PMLR.

Fernholz, Luisa Turrin. 1983.*von Mises calculus for statistical functionals*. Lecture Notes in Statistics 19. New York: Springer.

Feydy, Jean, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyré. 2019.“Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences.” In*Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics*, 2681–90. PMLR.

Flamary, Rémi, Nicolas Courty, Alexandre Gramfort, Mokhtar Z. Alaya, Aurélie Boisbunon, Stanislas Chambon, Laetitia Chapel, et al. 2021.“POT: Python Optimal Transport.”*Journal of Machine Learning Research* 22 (78): 1–8.

Flamary, Rémi, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. 2018.“Wasserstein Discriminant Analysis.”*Machine Learning* 107 (12): 1923–45.

Flamary, Remi, Alain Rakotomamonjy, Nicolas Courty, and Devis Tuia. n.d.“Optimal Transport with Laplacian Regularization,” 10.

Frogner, Charlie, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. 2015.“Learning with a Wasserstein Loss.” In*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2053–61. Curran Associates, Inc.

Gao, Rui, and Anton J. Kleywegt. 2022.“Distributionally Robust Stochastic Optimization with Wasserstein Distance.” arXiv.

Garbuno-Inigo, Alfredo, Franca Hoffmann, Wuchen Li, and Andrew M. Stuart. 2020.“Interacting Langevin Diffusions: Gradient Structure and Ensemble Kalman Sampler.”*SIAM Journal on Applied Dynamical Systems* 19 (1): 412–41.

Genevay, Aude, Marco Cuturi, Gabriel Peyré, and Francis Bach. 2016.“Stochastic Optimization for Large-Scale Optimal Transport.” In*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 3432–40. Curran Associates, Inc.

Genevay, Aude, Gabriel Peyré, and Marco Cuturi. 2017.“Learning Generative Models with Sinkhorn Divergences.”*arXiv:1706.00292 [Stat]*, October.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014.“Generative Adversarial Nets.” In*Advances in Neural Information Processing Systems 27*, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2672–80. NIPS’14. Cambridge, MA, USA: Curran Associates, Inc.

Gozlan, Nathael, and Christian Léonard. 2010.“Transport Inequalities. A Survey.”*arXiv:1003.3852 [Math]*, March.

Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017.“Improved Training of Wasserstein GANs.”*arXiv:1704.00028 [Cs, Stat]*, March.

Guo, Xin, Johnny Hong, Tianyi Lin, and Nan Yang. 2017.“Relaxed Wasserstein with Applications to GANs.”*arXiv:1705.07164 [Cs, Stat]*, May.

Huggins, Jonathan H., Trevor Campbell, Mikołaj Kasprzak, and Tamara Broderick. 2018a.“Scalable Gaussian Process Inference with Finite-Data Mean and Variance Guarantees.”*arXiv:1806.10234 [Cs, Stat]*, June.

———. 2018b.“Practical Bounds on the Error of Bayesian Posterior Approximations: A Nonasymptotic Approach.”*arXiv:1809.09505 [Cs, Math, Stat]*, September.

Huggins, Jonathan H., Mikołaj Kasprzak, Trevor Campbell, and Tamara Broderick. 2019.“Practical Posterior Error Bounds from Variational Objectives.”*arXiv:1910.04102 [Cs, Math, Stat]*, October.

Huggins, Jonathan, Ryan P Adams, and Tamara Broderick. 2017.“PASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 3611–21. Curran Associates, Inc.

Khan, Gabriel, and Jun Zhang. 2022.“When Optimal Transport Meets Information Geometry.”*Information Geometry*, June.

Kim, Jin W., and Prashant G. Mehta. 2019.“An Optimal Control Derivation of Nonlinear Smoothing Equations,” April.

Léonard, Christian. 2014.“A Survey of the Schrödinger Problem and Some of Its Connections with Optimal Transport.”*Discrete & Continuous Dynamical Systems - A* 34 (4): 1533.

Liu, Huidong, Xianfeng Gu, and Dimitris Samaras. 2018.“A Two-Step Computation of the Exact GAN Wasserstein Distance.” In*International Conference on Machine Learning*, 3159–68.

Liu, Qiang, and Dilin Wang. 2019.“Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In*Advances In Neural Information Processing Systems*.

Louizos, Christos, and Max Welling. 2017.“Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In*PMLR*, 2218–27.

Magyar, Jared C., and Malcolm S. Sambridge. 2022.“The Wasserstein Distance as a Hydrological Objective Function.” Preprint. Catchment hydrology/Mathematical applications.

Mahdian, Saied, Jose Blanchet, and Peter Glynn. 2019.“Optimal Transport Relaxations with Application to Wasserstein GANs.”*arXiv:1906.03317 [Cs, Math, Stat]*, June.

Mallasto, Anton, Augusto Gerolin, and Hà Quang Minh. 2021.“Entropy-Regularized 2-Wasserstein Distance Between Gaussian Measures.”*Information Geometry*, August.

Marzouk, Youssef, Tarek Moselhy, Matthew Parno, and Alessio Spantini. 2016.“Sampling via Measure Transport: An Introduction.” In*Handbook of Uncertainty Quantification*, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1:1–41. Cham: Springer Heidelberg.

Maurya, Abhinav. 2018.“Optimal Transport in Statistical Machine Learning : Selected Review and Some Open Questions.” In.

Minh, Hà Quang. 2022.“Finite Sample Approximations of Exact and Entropic Wasserstein Distances Between Covariance Operators and Gaussian Processes.”*SIAM/ASA Journal on Uncertainty Quantification*, February, 96–124.

Mohajerin Esfahani, Peyman, and Daniel Kuhn. 2018.“Data-Driven Distributionally Robust Optimization Using the Wasserstein Metric: Performance Guarantees and Tractable Reformulations.”*Mathematical Programming* 171 (1): 115–66.

Montavon, Grégoire, Klaus-Robert Müller, and Marco Cuturi. 2016.“Wasserstein Training of Restricted Boltzmann Machines.” In*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 3711–19. Curran Associates, Inc.

Ostrovski, Georg, Will Dabney, and Remi Munos. n.d.“Autoregressive Quantile Networks for Generative Modeling,” 10.

Panaretos, Victor M., and Yoav Zemel. 2019.“Statistical Aspects of Wasserstein Distances.”*Annual Review of Statistics and Its Application* 6 (1): 405–31.

Perrot, Michaël, Nicolas Courty, Rémi Flamary, and Amaury Habrard. n.d.“Mapping Estimation for Discrete Optimal Transport,” 9.

Peyré, Gabriel, and Marco Cuturi. 2019.*Computational Optimal Transport*. Vol. 11.

Peyré, Gabriel, Marco Cuturi, and Justin Solomon. 2016.“Gromov-Wasserstein Averaging of Kernel and Distance Matrices.” In*International Conference on Machine Learning*, 2664–72. PMLR.

Redko, Ievgen, Nicolas Courty, Rémi Flamary, and Devis Tuia. 2019.“Optimal Transport for Multi-Source Domain Adaptation Under Target Shift.” In*The 22nd International Conference on Artificial Intelligence and Statistics*, 849–58. PMLR.

Rezende, Danilo Jimenez, and Shakir Mohamed. 2015.“Variational Inference with Normalizing Flows.” In*International Conference on Machine Learning*, 1530–38. ICML’15. Lille, France: JMLR.org.

Rigollet, Philippe, and Jonathan Weed. 2018.“Entropic Optimal Transport Is Maximum-Likelihood Deconvolution.” arXiv.

Rustamov, Raif M. 2021.“Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.”*Stat* 10 (1): e329.

Salimans, Tim, Han Zhang, Alec Radford, and Dimitris Metaxas. 2018.“Improving GANs Using Optimal Transport.” arXiv.

Sambridge, Malcolm, Andrew Jackson, and Andrew P Valentine. 2022.“Geophysical Inversion and Optimal Transport.”*Geophysical Journal International* 231 (1): 172–98.

Santambrogio, Filippo. 2015.*Optimal Transport for Applied Mathematicians*. Edited by Filippo Santambrogio. Progress in Nonlinear Differential Equations and Their Applications. Cham: Springer International Publishing.

Schmitzer, Bernhard. 2019.“Stabilized Sparse Scaling Algorithms for Entropy Regularized Transport Problems.”*arXiv:1610.06519 [Cs, Math]*, February.

Solomon, Justin, Fernando de Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. 2015.“Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains.”*ACM Transactions on Graphics* 34 (4): 66:1–11.

Spantini, Alessio, Daniele Bigoni, and Youssef Marzouk. 2017.“Inference via Low-Dimensional Couplings.”*Journal of Machine Learning Research* 19 (66): 2639–709.

Taghvaei, Amirhossein, and Prashant G. Mehta. 2021.“An Optimal Transport Formulation of the Ensemble Kalman Filter.”*IEEE Transactions on Automatic Control* 66 (7): 3052–67.

Verdinelli, Isabella, and Larry Wasserman. 2019.“Hybrid Wasserstein Distance and Fast Distribution Clustering.”*Electronic Journal of Statistics* 13 (2): 5088–5119.

Viehmann, Thomas. 2019.“Implementation of Batched Sinkhorn Iterations for Entropy-Regularized Wasserstein Loss.” arXiv.

Wang, Prince Zizhuang, and William Yang Wang. 2019.“Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling.” In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)*, 284–94. Minneapolis, Minnesota: Association for Computational Linguistics.

Wang, Wei, Dejan Slepčev, Saurav Basu, John A. Ozolek, and Gustavo K. Rohde. 2013.“A Linear Optimal Transportation Framework for Quantifying and Visualizing Variations in Sets of Images.”*International Journal of Computer Vision* 101 (2): 254–69.

Zhang, Rui, Christian Walder, Edwin V. Bonilla, Marian-Andrei Rizoiu, and Lexing Xie. 2020.“Quantile Propagation for Wasserstein-Approximate Gaussian Processes.” In*Proceedings of NeurIPS 2020*.

Zhu, B., J. Jiao, and D. Tse. 2020.“Deconstructing Generative Adversarial Networks.”*IEEE Transactions on Information Theory* 66 (11): 7155–79.

AnIntegral probability metric. The intersection ofreproducing kernel methods,dependence tests andprobability metrics; where we use an kernel embedding to cleverlymeasure differences between probability distributions, typically an RKHS embedding.

Can be estimated from samples only, which is neat.

A mere placeholder. For a thorough treatment see the canonical references(Gretton et al. 2008;Gretton, Borgwardt, et al. 2012).

Arthur Gretton, Dougal Sutherland, Wittawat Jitkrittum presentation:Interpretable Comparison of Distributions and Models.

Danica Sutherland’s explanation is IMO magnificent.

Pierre Alquier’s postUniversal estimation with Maximum Mean Discrepancy (MMD) shows how to use MMD in arobust nonparametric estimator.

Gaël Varoquaux’ introduction is friendly and illustrated,Comparing distributions: Kernels estimate good representations, l1 distances give good tests based on(Scetbon and Varoquaux 2019).

Husain (2020)’s results connect IPMs totransport metrics and regularisation theory, andclassification.

Feydy et al. (2019) connects MMD tooptimal transport losses.

Arbel et al. (2019) also looks pertinent and has some connections to Wassersteingradient flows.

Hmm. SeeGretton, Sriperumbudur, et al. (2012).

MMD is included in theITE toolbox (estimators).

The

GeomLosslibrary provides efficient GPU implementations for:

- Kernel norms (also known asMaximum Mean Discrepancies).
- Hausdorff divergences, which are positive definite generalizations of theChamfer-ICP loss and are analogous to
log-likelihoodsof Gaussian Mixture Models.- Debiased Sinkhorn divergences, which are affordable yet
positive and definiteapproximations ofOptimal Transport (Wasserstein) distances.It is hosted onGitHub and distributed under the permissiveMIT license.

pypipepyGeomLoss functions are available through the customPyTorch layers

`SamplesLoss`

,`ImagesLoss`

and`VolumesLoss`

which allow you to work with weightedpoint clouds(of any dimension),density mapsandvolumetric segmentation masks.

Arbel, Michael, Anna Korba, Adil Salim, and Arthur Gretton. 2019.“Maximum Mean Discrepancy Gradient Flow.” In*Proceedings of the 33rd International Conference on Neural Information Processing Systems*, 32:6484–94. Red Hook, NY, USA: Curran Associates Inc.

Arras, Benjamin, Ehsan Azmoodeh, Guillaume Poly, and Yvik Swan. 2017.“A Bound on the 2-Wasserstein Distance Between Linear Combinations of Independent Random Variables.”*arXiv:1704.01376 [Math]*, April.

Blanchet, Jose, Lin Chen, and Xun Yu Zhou. 2018.“Distributionally Robust Mean-Variance Portfolio Selection with Wasserstein Distances.”*arXiv:1802.04885 [Stat]*, February.

Dellaporta, Charita, Jeremias Knoblauch, Theodoros Damoulas, and François-Xavier Briol. 2022.“Robust Bayesian Inference for Simulator-Based Models via the MMD Posterior Bootstrap.”*arXiv:2202.04744 [Cs, Stat]*, February.

Feydy, Jean, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouve, and Gabriel Peyré. 2019.“Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences.” In*Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics*, 2681–90. PMLR.

Gretton, Arthur, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012.“A Kernel Two-Sample Test.”*The Journal of Machine Learning Research* 13 (1): 723–73.

Gretton, Arthur, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexander J Smola. 2008.“A Kernel Statistical Test of Independence.” In*Advances in Neural Information Processing Systems 20: Proceedings of the 2007 Conference*. Cambridge, MA: MIT Press.

Gretton, Arthur, Bharath Sriperumbudur, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, and Kenji Fukumizu. 2012.“Optimal Kernel Choice for Large-Scale Two-Sample Tests.” In*Proceedings of the 25th International Conference on Neural Information Processing Systems*, 1205–13. NIPS’12. Red Hook, NY, USA: Curran Associates Inc.

Hamzi, Boumediene, and Houman Owhadi. 2021.“Learning Dynamical Systems from Data: A Simple Cross-Validation Perspective, Part I: Parametric Kernel Flows.”*Physica D: Nonlinear Phenomena* 421 (July): 132817.

Husain, Hisham. 2020.“Distributional Robustness with IPMs and Links to Regularization and GANs.”*arXiv:2006.04349 [Cs, Stat]*, June.

Jitkrittum, Wittawat, Wenkai Xu, Zoltan Szabo, Kenji Fukumizu, and Arthur Gretton. 2017.“A Linear-Time Kernel Goodness-of-Fit Test.” In*Advances in Neural Information Processing Systems*. Vol. 30. Curran Associates, Inc.

Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, Arthur Gretton, and Bernhard Schölkopf. 2014.“Kernel Mean Shrinkage Estimators.”*arXiv:1405.5505 [Cs, Stat]*, May.

Muandet, Krikamol, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Schölkopf. 2017.“Kernel Mean Embedding of Distributions: A Review and Beyond.”*Foundations and Trends® in Machine Learning* 10 (1-2): 1–141.

Nishiyama, Yu, and Kenji Fukumizu. 2016.“Characteristic Kernels and Infinitely Divisible Distributions.”*The Journal of Machine Learning Research* 17 (1): 6240–67.

Pfister, Niklas, Peter Bühlmann, Bernhard Schölkopf, and Jonas Peters. 2018.“Kernel-Based Tests for Joint Independence.”*Journal of the Royal Statistical Society: Series B (Statistical Methodology)* 80 (1): 5–31.

Rustamov, Raif M. 2021.“Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.”*Stat* 10 (1): e329.

Scetbon, Meyer, and Gael Varoquaux. 2019.“Comparing Distributions:\(\ell_1\) Geometry Improves Kernel Two-Sample Testing.” In*Advances in Neural Information Processing Systems 32*, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, 12306–16. Curran Associates, Inc.

Schölkopf, Bernhard, Krikamol Muandet, Kenji Fukumizu, and Jonas Peters. 2015.“Computing Functions of Random Variables via Reproducing Kernel Hilbert Space Representations.”*arXiv:1501.06794 [Cs, Stat]*, January.

Sejdinovic, Dino, Bharath Sriperumbudur, Arthur Gretton, and Kenji Fukumizu. 2012.“Equivalence of Distance-Based and RKHS-Based Statistics in Hypothesis Testing.”*The Annals of Statistics* 41 (5): 2263–91.

Smola, Alex, Arthur Gretton, Le Song, and Bernhard Schölkopf. 2007.“A Hilbert Space Embedding for Distributions.” In*Algorithmic Learning Theory*, edited by Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto, 13–31. Lecture Notes in Computer Science 4754. Springer Berlin Heidelberg.

Song, Le, Kenji Fukumizu, and Arthur Gretton. 2013.“Kernel Embeddings of Conditional Distributions: A Unified Kernel Framework for Nonparametric Inference in Graphical Models.”*IEEE Signal Processing Magazine* 30 (4): 98–111.

Song, Le, Jonathan Huang, Alex Smola, and Kenji Fukumizu. 2009.“Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems.” In*Proceedings of the 26th Annual International Conference on Machine Learning*, 961–68. ICML ’09. New York, NY, USA: ACM.

Sriperumbudur, B. K., A. Gretton, K. Fukumizu, G. Lanckriet, and B. Schölkopf. 2008.“Injective Hilbert Space Embeddings of Probability Measures.” In*Proceedings of the 21st Annual Conference on Learning Theory (COLT 2008)*.

Sriperumbudur, Bharath K., Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert R. G. Lanckriet. 2012.“On the Empirical Estimation of Integral Probability Metrics.”*Electronic Journal of Statistics* 6: 1550–99.

Sriperumbudur, Bharath K., Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet. 2010.“Hilbert Space Embeddings and Metrics on Probability Measures.”*Journal of Machine Learning Research* 11 (April): 1517−1561.

Strobl, Eric V., Kun Zhang, and Shyam Visweswaran. 2017.“Approximate Kernel-Based Conditional Independence Tests for Fast Non-Parametric Causal Discovery.”*arXiv:1702.03877 [Stat]*, February.

Szabo, Zoltan, and Bharath K. Sriperumbudur. 2017.“Characteristic and Universal Tensor Product Kernels.”*arXiv:1708.08157 [Cs, Math, Stat]*, August.

Tolstikhin, Ilya O, Bharath K. Sriperumbudur, and Bernhard Schölkopf. 2016.“Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels.” In*Advances in Neural Information Processing Systems 29*, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 1930–38. Curran Associates, Inc.

Zhang, Kun, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2012.“Kernel-Based Conditional Independence Test and Application in Causal Discovery.”*arXiv:1202.3775 [Cs, Stat]*, February.

Zhang, Qinyi, Sarah Filippi, Arthur Gretton, and Dino Sejdinovic. 2016.“Large-Scale Kernel Methods for Independence Testing.”*arXiv:1606.07892 [Stat]*, June.

SinceGaussian approximations pop up a lot in e.g.variational approximation problems, it is nice to know how variousprobability metrics come out for them.

Since the “difficult” part of the problem is the distance between the covariances, this often ends up being the same, or at least closely related to the question ofmatrix norms, where the matrices in question are the positive (semi-)definite covariance/precision matrices.

A useful analytic result aboutWasserstein-2 distance, i.e.\(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for\(X\sim\nu\),\(Y\sim\mu\).Two Gaussians may be related thusly(Givens and Shortt 1984):\[\begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]

In the centred case this the even simpler\[\begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]

Pulled fromwikipedia:

\[ D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]

In the centred case this reduces to

\[ D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]

Djalil
defines Hellinger*distance*\[\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.\]
via Hellinger*affinity*\[\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).\]
For Gaussians it apparently turns out that\[\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),\]

In multiple dimensions:\[\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).\]

High speed introduction to geometry of distances:Moakher and Batchelor (2006), which discusses the problems of comparing covariance matrices on a natural manifold.

The geodesic distance between\(\mathbf{P}\) and\(\mathbf{Q}\) in\(\mathcal{P}(n)\) is given byLang (1999)\[ \mathrm{d}_R(\mathbf{P}, \mathbf{Q})=\left\|\log \left(\mathbf{P}^{-1} \mathbf{Q}\right)\right\|_F=\left[\sum_{i=1}^n \log ^2 \lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right)\right]^{1 / 2}, \] where\(\lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right), 1 \leq i \leq n\) are the eigenvalues of the matrix\(\mathbf{P}^{-1} \mathbf{Q}\). Because\(\mathbf{P}^{-1} \mathbf{Q}\) is similar to\(\mathbf{P}^{-1 / 2} \mathbf{Q} \mathbf{P}^{-1 / 2}\), the eigenvalues\(\lambda_i\left(\mathbf{P}^{-1} \mathbf{Q}\right)\) are all positive and hence [this] is well defined for all\(\mathbf{P}\) and\(\mathbf{Q}\) of\(\mathcal{P}(n)\).

With the Gaussian kernel, there is a closed-form expression for the distance of an RV from the standard Gaussian(Rustamov 2021).

The computation of the MMD requires specifying a positive-definite kernel; in this paper we always assume it to be the Gaussian RBF kernel of width\(\gamma\), namely,\(k(x, y)=e^{-\|x-y\|^2 /\left(2 \gamma^2\right)}\). Here,\(x, y \in \mathbb{R}^d\), where\(d\) is the dimension of the code/latent space, and we use\(\|\cdot\|\) to denote the\(\ell_2\) norm. The population MMD can be most straight-forwardly computed via the formula(Gretton et al. 2012):\[ \operatorname{MMD}^2(P, Q)=\mathbb{E}_{x, x^{\prime} \sim P}\left[k\left(x, x^{\prime}\right)\right]-2 \mathbb{E}_{x \sim P, y \sim Q}[k(x, y)]+\mathbb{E}_{y, y^{\prime} \sim Q}\left[k\left(y, y^{\prime}\right)\right] \]

[U]sing the sample\(Q_n=\left\{z_i\right\}_{i=1}^n\) of size\(n\), we replace the last two terms by the sample average and the U-statistic respectively to obtain the unbiased estimator:\[ \operatorname{MMD}_u^2\left(\mathcal{N}_d, Q_n\right)=\mathbb{E}_{x, x^{\prime} \sim \mathcal{N}_d}\left[k\left(x, x^{\prime}\right)\right]-\frac{2}{n} \sum_{i=1}^n \mathbb{E}_{x \sim \mathcal{N}_d}\left[k\left(x, z_i\right)\right]+\frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i}^n k\left(z_i, z_j\right) \] Our main result is the following proposition whose proof can be found in Appendix C: Proposition. The expectations in the expression above can be computed analytically to yield the formula:\[ \operatorname{MMD}_u^2\left(\mathcal{N}_d, Q_n\right)=\left(\frac{\gamma^2}{2+\gamma^2}\right)^{d / 2}-\frac{2}{n}\left(\frac{\gamma^2}{1+\gamma^2}\right)^{d / 2} \sum_{i=1}^n e^{-\frac{\left\|z_i\right\|^2}{2\left(1+\gamma^2\right)}}+\frac{1}{n(n-1)} \sum_{i=1}^n \sum_{j \neq i}^n e^{-\frac{\left\|z_i-z_j\right\|^2}{2 \gamma^2}} \]

Extending this to arbitrary Gaussians, or analytic Gaussians, is left as an exercise.

Intuitively, we might prefer other kernels than the RBF if we are comparing Gaussians specifically; in particular the RBF is “wasteful” in that it controls moments of all orders, whereas we might only care about the first two moments (mean and covariance), for Gaussians.

Givens, Clark R., and Rae Michael Shortt. 1984.“A Class of Wasserstein Metrics for Probability Distributions.”*The Michigan Mathematical Journal* 31 (2): 231–40.

Gretton, Arthur, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012.“A Kernel Two-Sample Test.”*The Journal of Machine Learning Research* 13 (1): 723–73.

Lang, Serge. 1999.*Fundamentals of Differential Geometry*. Vol. 191. Graduate Texts in Mathematics. New York, NY: Springer.

Magnus, Jan R., and Heinz Neudecker. 2019.*Matrix differential calculus with applications in statistics and econometrics*. 3rd ed. Wiley series in probability and statistics. Hoboken (N.J.): Wiley.

Meckes, Elizabeth. 2009.“On Stein’s Method for Multivariate Normal Approximation.” In*High Dimensional Probability V: The Luminy Volume*, 153–78. Beachwood, Ohio, USA: Institute of Mathematical Statistics.

Minka, Thomas P. 2000.*Old and new matrix algebra useful for statistics*.

Moakher, Maher, and Philipp G. Batchelor. 2006.“Symmetric Positive-Definite Matrices: From Geometry to Applications and Visualization.” In*Visualization and Processing of Tensor Fields*, edited by Joachim Weickert and Hans Hagen, 285–98. Berlin, Heidelberg: Springer Berlin Heidelberg.

Omladič, Matjaž, and Peter Šemrl. 1990.“On the Distance Between Normal Matrices.”*Proceedings of the American Mathematical Society* 110 (3): 591–96.

Petersen, Kaare Brandt, and Michael Syskind Pedersen. 2012.“The Matrix Cookbook.”

Rustamov, Raif M. 2021.“Closed-Form Expressions for Maximum Mean Discrepancy with Applications to Wasserstein Auto-Encoders.”*Stat* 10 (1): e329.

Takatsu, Asuka. 2008.“On Wasserstein Geometry of the Space of Gaussian Measures,” January.

Zhang, Yufeng, Wanwei Liu, Zhenbang Chen, Ji Wang, and Kenli Li. 2022.“On the Properties of Kullback-Leibler Divergence Between Multivariate Gaussian Distributions.” arXiv.

Reparameterization trick.
A trick where we cleverlytransform RVs to sample from tricky target distributions, and theirjacobians, via a “nice” nice source distribution.
Useful in e.g.variational inference, especiallyautoencoders,
fordensity estimation inprobabilistic deep learning.
Pairs well withnormalizing flows to get powerful target distributions.Storchastic credits pathwise gradients toGlasserman and Ho (1991) as*perturbation analysis*.

- Shakir Mohamed,Machine Learning Trick of the Day (4): Reparameterisation Tricks:

Suppose we want the gradient of an expectation of a smooth function\(f\) :\[ \nabla_\theta \mathbb{E}_{p(z ; \theta)}[f(z)]=\nabla_\theta \int p(z ; \theta) f(z) d z \] […]This gradient is often difficult to compute because the integral is typically unknown and the parameters\(\theta\), with respect to which we are computing the gradient, are of the distribution\(p(z ; \theta)\).

Now we suppose that we know some function\(g\) such that for some easy distribution\(p(\epsilon)\),\(z | \theta=g(\epsilon, \theta)\). Now we can try to estimate the gradient of the expectation by Monte Carlo:

\[ \nabla_\theta \mathbb{E}_{p(z ; \theta)}[f(z)]=\mathbb{E}_{p(c)}\left[\nabla_\theta f(g(\epsilon, \theta))\right] \] Let's derive this expression and explore the implications of it for our optimisation problem. One-liners give us a transformation from a distribution\(p(\epsilon)\) to another\(p(z)\), thus the differential area (mass of the distribution) is invariant under the change of variables. This property implies that:\[ p(z)=\left|\frac{d \epsilon}{d z}\right| p(\epsilon) \Longrightarrow|p(z) d z|=|p(\epsilon) d \epsilon| \] Re-expressing the troublesome stochastic optimisation problem using random variate reparameterisation, we find:\[ \begin{aligned} & \nabla_\theta \mathbb{E}_{p(z ; \theta)}[f(z)]=\nabla_\theta \int p(z ; \theta) f(z) d z \\ = & \nabla_\theta \int p(\epsilon) f(z) d \epsilon=\nabla_\theta \int p(\epsilon) f(g(\epsilon, \theta)) d \epsilon \\ = & \nabla_\theta \mathbb{E}_{p(c)}[f(g(\epsilon, \theta))]=\mathbb{E}_{p(e)}\left[\nabla_\theta f(g(\epsilon, \theta))\right] \end{aligned} \]

- Yuge Shi’s variational inference tutorial is a tour of cunning reparameterisation gradient tricks. Written for her paperShi et al. (2019). She punts some details toMohamed et al. (2020) which in turn tells me that this adventure continues atreparameterization gradients andMonte Carlo gradient estimation.Figurnov, Mohamed, and Mnih (2018),Devroye (2006) andJankowiak and Obermeyer (2018).

Cunning reparameterization maps with desirable properties for nonparametric inference. Seenormalizing flows.

Seetransport maps.

Ambrosio, Luigi, Nicola Gigli, and Giuseppe Savare. 2008.*Gradient Flows: In Metric Spaces and in the Space of Probability Measures*. 2nd ed. Lectures in Mathematics. ETH Zürich. Birkhäuser Basel.

Bamler, Robert, and Stephan Mandt. 2017.“Structured Black Box Variational Inference for Latent Time Series Models.”*arXiv:1707.01069 [Cs, Stat]*, July.

Berg, Rianne van den, Leonard Hasenclever, Jakub M. Tomczak, and Max Welling. 2018.“Sylvester Normalizing Flows for Variational Inference.” In*UAI18*.

Caterini, Anthony L., Arnaud Doucet, and Dino Sejdinovic. 2018.“Hamiltonian Variational Auto-Encoder.” In*Advances in Neural Information Processing Systems*.

Charpentier, Bertrand, Oliver Borchert, Daniel Zügner, Simon Geisler, and Stephan Günnemann. 2022.“Natural Posterior Network: Deep Bayesian Uncertainty for Exponential Family Distributions.”*arXiv:2105.04471 [Cs, Stat]*, March.

Chen, Changyou, Chunyuan Li, Liqun Chen, Wenlin Wang, Yunchen Pu, and Lawrence Carin. 2017.“Continuous-Time Flows for Efficient Inference and Density Estimation.”*arXiv:1709.01179 [Stat]*, September.

Chen, Tian Qi, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018.“Neural Ordinary Differential Equations.” In*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6572–83. Curran Associates, Inc.

Devroye, Luc. 2006.“Chapter 4 Nonuniform Random Variate Generation.” In*Simulation*, edited by Shane G. Henderson and Barry L. Nelson, 13:83–121. Handbooks in Operations Research and Management Science. Elsevier.

Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. 2016.“Density Estimation Using Real NVP.” In*Advances In Neural Information Processing Systems*.

Figurnov, Mikhail, Shakir Mohamed, and Andriy Mnih. 2018.“Implicit Reparameterization Gradients.” In*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 441–52. Curran Associates, Inc.

Glasserman, Paul, and Yu-Chi Ho. 1991.*Gradient Estimation Via Perturbation Analysis*. Springer Science & Business Media.

Grathwohl, Will, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. 2018.“FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models.”*arXiv:1810.01367 [Cs, Stat]*, October.

Huang, Chin-Wei, David Krueger, Alexandre Lacoste, and Aaron Courville. 2018.“Neural Autoregressive Flows.”*arXiv:1804.00779 [Cs, Stat]*, April.

Jankowiak, Martin, and Fritz Obermeyer. 2018.“Pathwise Derivatives Beyond the Reparameterization Trick.” In*International Conference on Machine Learning*, 2235–44.

Kingma, Diederik P., Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016.“Improving Variational Inference with Inverse Autoregressive Flow.” In*Advances in Neural Information Processing Systems 29*. Curran Associates, Inc.

Kingma, Diederik P., Tim Salimans, and Max Welling. 2015.“Variational Dropout and the Local Reparameterization Trick.” In*Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2*, 2575–83. NIPS’15. Cambridge, MA, USA: MIT Press.

Kingma, Diederik P., and Max Welling. 2014.“Auto-Encoding Variational Bayes.” In*ICLR 2014 Conference*.

Kingma, Durk P, and Prafulla Dhariwal. 2018.“Glow: Generative Flow with Invertible 1x1 Convolutions.” In*Advances in Neural Information Processing Systems 31*, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 10236–45. Curran Associates, Inc.

Koehler, Frederic, Viraj Mehta, and Andrej Risteski. 2020.“Representational Aspects of Depth and Conditioning in Normalizing Flows.”*arXiv:2010.01155 [Cs, Stat]*, October.

Lin, Wu, Mohammad Emtiyaz Khan, and Mark Schmidt. 2019.“Stein’s Lemma for the Reparameterization Trick with Exponential Family Mixtures.”*arXiv:1910.13398 [Cs, Stat]*, October.

Louizos, Christos, and Max Welling. 2017.“Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In*PMLR*, 2218–27.

Lu, You, and Bert Huang. 2020.“Woodbury Transformations for Deep Generative Flows.” In*Advances in Neural Information Processing Systems*. Vol. 33.

Marzouk, Youssef, Tarek Moselhy, Matthew Parno, and Alessio Spantini. 2016.“Sampling via Measure Transport: An Introduction.” In*Handbook of Uncertainty Quantification*, edited by Roger Ghanem, David Higdon, and Houman Owhadi, 1:1–41. Cham: Springer Heidelberg.

Massaroli, Stefano, Michael Poli, Michelangelo Bin, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. 2020.“Stable Neural Flows.”*arXiv:2003.08063 [Cs, Math, Stat]*, March.

Mohamed, Shakir, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. 2020.“Monte Carlo Gradient Estimation in Machine Learning.”*Journal of Machine Learning Research* 21 (132): 1–62.

Ng, Tin Lok James, and Andrew Zammit-Mangion. 2020.“Non-Homogeneous Poisson Process Intensity Modeling and Estimation Using Measure Transport.”*arXiv:2007.00248 [Stat]*, July.

Papamakarios, George. 2019.“Neural Density Estimation and Likelihood-Free Inference.” The University of Edinburgh.

Papamakarios, George, Iain Murray, and Theo Pavlakou. 2017.“Masked Autoregressive Flow for Density Estimation.” In*Advances in Neural Information Processing Systems 30*, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2338–47. Curran Associates, Inc.

Papamakarios, George, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021.“Normalizing Flows for Probabilistic Modeling and Inference.”*Journal of Machine Learning Research* 22 (57): 1–64.

Pfau, David, and Danilo Rezende. 2020.“Integrable Nonparametric Flows.” In, 7.

Ran, Zhi-Yong, and Bao-Gang Hu. 2017.“Parameter Identifiability in Statistical Machine Learning: A Review.”*Neural Computation* 29 (5): 1151–1203.

Rezende, Danilo Jimenez, and Shakir Mohamed. 2015.“Variational Inference with Normalizing Flows.” In*International Conference on Machine Learning*, 1530–38. ICML’15. Lille, France: JMLR.org.

Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2015.“Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” In*Proceedings of ICML*.

Rippel, Oren, and Ryan Prescott Adams. 2013.“High-Dimensional Probability Estimation with Deep Density Models.”*arXiv:1302.5125 [Cs, Stat]*, February.

Ruiz, Francisco J. R., Michalis K. Titsias, and David M. Blei. 2016.“The Generalized Reparameterization Gradient.” In*Advances In Neural Information Processing Systems*.

Shi, Yuge, N. Siddharth, Brooks Paige, and Philip H. S. Torr. 2019.“Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models.”*arXiv:1911.03393 [Cs, Stat]*, November.

Spantini, Alessio, Ricardo Baptista, and Youssef Marzouk. 2022.“Coupling Techniques for Nonlinear Ensemble Filtering.”*SIAM Review* 64 (4): 921–53.

Spantini, Alessio, Daniele Bigoni, and Youssef Marzouk. 2017.“Inference via Low-Dimensional Couplings.”*Journal of Machine Learning Research* 19 (66): 2639–709.

Tabak, E. G., and Cristina V. Turner. 2013.“A Family of Nonparametric Density Estimation Algorithms.”*Communications on Pure and Applied Mathematics* 66 (2): 145–64.

Tabak, Esteban G., and Eric Vanden-Eijnden. 2010.“Density Estimation by Dual Ascent of the Log-Likelihood.”*Communications in Mathematical Sciences* 8 (1): 217–33.

Wang, Prince Zizhuang, and William Yang Wang. 2019.“Riemannian Normalizing Flow on Variational Wasserstein Autoencoder for Text Modeling.” In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)*, 284–94. Minneapolis, Minnesota: Association for Computational Linguistics.

Xu, Ming, Matias Quiroz, Robert Kohn, and Scott A. Sisson. 2018.“Variance Reduction Properties of the Reparameterization Trick.”*arXiv:1809.10330 [Cs, Stat]*, September.

Yang, Yunfei, Zhen Li, and Yang Wang. 2021.“On the Capacity of Deep Generative Networks for Approximating Distributions.”*arXiv:2101.12353 [Cs, Math, Stat]*, January.

Zahm, Olivier, Paul Constantine, Clémentine Prieur, and Youssef Marzouk. 2018.“Gradient-Based Dimension Reduction of Multivariate Vector-Valued Functions.”*arXiv:1801.07922 [Math]*, January.

Zhang, Xin, and Andrew Curtis. 2021.“Bayesian Geophysical Inversion Using Invertible Neural Networks.”*Journal of Geophysical Research: Solid Earth* 126 (7): e2021JB022320.