Welcome to the probability inequality mines!
When something in your process (measurement, estimation) means that you can be pretty sure that a whole bunch of your stuff is particularly likely to be somewhere in particular.
As undergraduates we run into central limit theorems, but there are many more diverse ways we can keep track of our probability, or at least most of it. This idea is a basic workhorse in univariate probability, and turns out to be yet more essential in multivariate matrix probability, as seen in matrix factorisation, compressive sensing, PAC-bounds and suchlike.
Overviews include
For any nonnegative random variable \(X,\) and \(t>0\) \[ \mathbb{P}\{X \geq t\} \leq \frac{\mathbb{E} X}{t} \] Corollary: if \(\phi\) is a strictly monotonically increasing nonnegative-valued function then for any random variable \(X\) and real number \(t\) \[ \mathbb{P}\{X \geq t\}=\mathbb{P}\{\phi(X) \geq \phi(t)\} \leq \frac{\mathbb{E} \phi(X)}{\phi(t)} \]
Taking \(\phi(x)=e^{s x}\) where \(s>0,\) for any random variable \(X,\) and any \(t>0,\) we have \[ \mathbb{P}\{X \geq t\}=\mathbb{P}\left\{e^{s X} \geq e^{s t}\right\} \leq \frac{\mathbb{E} e^{s X}}{e^{s t}} \] Once again, we choose \(s\) to make the bound as tight as possible.
🏗
Let \(g: \mathcal{X}^{n} \rightarrow \mathbb{R}\) be a real-valued measurable function of n variables. Efron-Stein inequalities concern the difference between the random variable \(Z=g\left(X_{1}, \ldots, X_{n}\right)\) and its expected value \(\mathbb{E Z}\) when \(X_{1}, \ldots, X_{n}\) are arbitrary independent random variables.
Define \(\mathbb{E}_{i}\) for the expected value with respect to the variable \(X_{i}\), that is, \(\mathbb{E}_{i} Z=\mathbb{E}\left[Z \mid X_{1}, \ldots, X_{i-1}, X_{i+1}, \ldots, X_{n}\right]\) Then \[ \operatorname{Var}(Z) \leq \sum_{i=1}^{n} \mathbb{E}\left[\left(Z-\mathbb{E}_{i} Z\right)^{2}\right] \]
Now, let \(X_{1}^{\prime}, \ldots, X_{n}^{\prime}\) be an independent copy of \(X_{1}, \ldots, X_{n}\). \[ Z_{i}^{\prime}=g\left(X_{1}, \ldots, X_{i}^{\prime}, \ldots, X_{n}\right) \] Alternatively, \[ \operatorname{Var}(Z) \leq \frac{1}{2} \sum_{i=1}^{n} \mathbb{E}\left[\left(Z-Z_{i}^{\prime}\right)^{2}\right] \] Nothing here seems to constrain the variables here to be real values, merely the function \(g\), but apparently they do not work for matrix variables as written — you need to see Matrix efron stein results for that.
🏗
For the Gaussian distribution. Filed there, perhaps?
🏗
Let us copy from wikipedia:
Heuristically: if we pick \(N\) complex numbers \(x_1,\dots,x_N \in\mathbb{C}\), and add them together, each multiplied by jointly independent random signs \(\pm 1\), then the expected value of the sum’s magnitude is close to \(\sqrt{|x_1|^{2}+ \cdots + |x_N|^{2}}\).
Let \({\varepsilon_n}_{n=1}^N\) i.i.d. random variables with \(P(\varepsilon_n=\pm1)=\frac12\) for \(n=1,\ldots, N\), i.e., a sequence with Rademacher distribution. Let \(0<p<\infty\) and let \(x_1,\ldots,x_N \in \mathbb{C}\). Then
\[ A_p \left( \sum_{n=1}^N |x_n|^2 \right)^{1/2} \leq \left(\operatorname{E} \left|\sum_{n=1}^N \varepsilon_n x_n\right|^p \right)^{1/p} \leq B_p \left(\sum_{n=1}^N |x_n|^2\right)^{1/2} \]
for some constants \(A_p,B_p>0\). It is a simple matter to see that \(A_p = 1\) when \(p \ge 2\), and \(B_p = 1\) when \(0 < p \le 2\).
🏗
If we fix our interest to matrices in particular, some fun things arise. See Matrix concentration inequalities
Concentration inequalities for matrix-valued random variables.
J. A. Tropp (2015) summarises:
In recent years, random matrices have come to play a major role in computational mathematics, but most of the classical areas of random matrix theory remain the province of experts. Over the last decade, with the advent of matrix concentration inequalities, research has advanced to the point where we can conquer many (formerly) challenging problems with a page or two of arithmetic.
Are these related?
Nikhil Srivastava’s Discrepancy, Graphs, and the Kadison-Singer Problem has an interesting example of bounds via discrepancy theory (and only indirectly probability). D. Gross (2011) is also readable, and gives results for matrices over the complex field.
As discussed in, e.g. Paulin, Mackey, and Tropp (2016).
Let \(\mathbf{X} \in \mathbb{H}^{d}\) be a random matrix. For all \(t>0\) \[ \mathbb{P}\{\|\mathbf{X}\| \geq t\} \leq \inf _{p \geq 1} t^{-p} \cdot \mathbb{E}\|\mathbf{X}\|_{S_{p}}^{p} \] Furthermore, \[ \mathbb{E}\|\mathbf{X}\| \leq \inf _{p \geq 1}\left(\mathbb{E}\|\mathbf{X}\|_{S_{p}}^{p}\right)^{1 / p}. \]
TBC.
The classical Efron-Stein inequalities are simple. The Matrix ones, not so much
e.g. Paulin, Mackey, and Tropp (2016).
Classic terminals, if you must, with thoughtful UI. macOS only.
🏗 It has many features.
tmux
integration.
Python API.
Little utilities that do useful things like scp files from a reLmote host to your local folder.
(I cannot help but feel superstitiously that a python API for a terminal is a security hole.)
If you are using other OSes, read on.
simple terminal aims to have less lines of code than anything else and as few extraneous features as possible.
If you are worried that your current terminal doesn’t use enough RAM, you can use hyper which is a javascript app version of terminal. It’s not too bad for one of these web technology desktop app things based on electron or similar, although it is not hard. It has lots of sexy features and nice graphics, to compensate for the obviously hefty RAM usage.
Weird quirk 1: It does not support dragging files into the terminal, which pretty much every alternative does. qweasd1’s hyper-drop-file extension enables suport.
hyper install hyper-drop-file
Weird quirk 2: Anything which looks remotely like a URL in the terminal becomes a link which the terminal will aggressively open if you so click on it or even drag over it which is incredibly annoying and slightly dangerous.
Apparently this behaviour has become configurable now and you can put webLinksActivationKey: ctrl
in your config file to only do it on Ctrl-Click.
Tilix is a terminal emulator that Gnome people tend to like. It has consistent keyboard shortcuts, tiles (but tiles terminals only) and integrates into the Gnome Experience.
Kovid Goyal made a terminal with c inner loops and python UI extensibility called kitty. It’s not famous, but probably worth checking since Kovid is a powerhouse of feature-packed development. In fact too many features and I’m kind of afraid of how fragile it looks. macOS/Unices.
terminator seems to be an acceptable default option for a pure native GNOME app without many frills.
Alacritty is a GPU-accelerated terminal editor that aims to draw text real fast. If that was your primary problem, fear not. Also, what is your world?
terminus supports some HTML graphics, and appears to work-ish.
Designerly graphics-friendly terminal aims to reinvent terminal protocols! Has a vision statement! However it’s dead in the water.
Some cool features
- Smart token-based input with inline autocomplete and automatic escaping
- Rich output for common tasks and formats, using MIME types + sniffing
- Asynchronous views for background / parallel tasks
- Full separation between front/back-end
TermKit is not a…
- …Web application. It runs as a regular desktop app.
- …Scripting language like PowerShell or bash. It focuses on executing commands only.
- …Full terminal emulator. It does not aim to e.g. host ‘vim’.
- …Reimplementation of the Unix toolchain. It replaces and/or enhances built-in commands and wraps external tools.
Various notes on fun uses for one’s body, and maintenance thereof. For workplace harm minimisation see ergonomics.
GMB workouts are fun, as are their guides, e.g. handstands. I’m currently trying to work through their muscle up tutorial.
Which apps are good for tracking exercise?
JEFIT, YAYOG, Strong. I’m currently enjoying Fitbod (affiliate link) which uses some kind of basic but effective regression modeling to suggest optimal workouts. That they want to call this “AI” should not dissuade you; we all need to do unsavoury things to make a living.
The Australian Sweat Bathing association advocates for sauna culture in Australia.
A Hot bath has benefits similar to exercise, so they claim.
Best public sauna in Sydney: North Sydney Olympic pool at Milson’s point. Hot, great view, cheap. But no relaxing area, and you have to wear your swimmers. However, it is closed 😭 until 2023.
Bondi Icebergs is a similar situation but an even better view and a much worse sauna design.
Pricier, but naked-friendly for private rental, Nature’s Energy Newtown
I would like an economical, well-supported laptop for doing Linux/FreeBSD/NetBSD stuff. =OK, I said at the start “FreeBSD/NetBSD/linux”, but let us be real: all the hardware support is for linux and I am not convinced that bootable FreeBSD laptops are common. I simply do not want to be snarked by BSD evangelists. I could have also claimed I wanted GNU Hurd support but hey now. Anyway, my goal is to spend less of my life being my own tech support team.
So, Linux laptops.
Thunderbolt 3+ support for external GPU would be nice. I am less interested in internal GPUs these day; that turned out to be a giant PITA.
The classic go-to linux laptop. Are they still… working? Is that a thing? I checked out of that question ages ago. Sounds like linux-compatibility is contested, but certain hardware configurations do support linux, AFAICT the less fashionable ones. Plus side: can be purchased from local vendor in Australia.
System76 laptops now look less ugly than last time I looked, when their aesthetic was “cereal boxes for cylons.” See, e.g., the Oryx, their hefty GPU model. Like the Razer gear, the fact that there is no company presence in Australia means that taking on such machines is a risk if they need warranty service or weird parts. System76 are the creators of Pop!OS which has various nice features for my own workflow. The darter pro appears to be the model that has enough thunderbolt support for eGPUs without arsing about having internal GPUs.
Introducing the Framework Laptop
Today, we are excited to unveil our first product: the Framework Laptop, a thin, lightweight, high-performance 13.5" notebook that can be upgraded, customized, and repaired in ways that no other notebook can.
We're here to prove that designing products to last doesn't require sacrificing performance, quality, or style. The Framework Laptop meets or beats the best of what's in the category:
the Framework Laptop offers unparalleled options to upgrade, customize and repair:
- Our Expansion Card system makes adapters a thing of the past, letting you choose exactly the ports you want and which side of the notebook you want them on. With four bays, you can select from USB-C, USB-A, HDMI, DisplayPort, MicroSD, ultra fast storage, a high-end headphone amp, and more.
- Along with socketed storage, WiFi, and two slots of memory, the entire mainboard can be swapped to boost performance as we launch updated versions with new CPU generations.
- High-use parts like the battery, screen, keyboard, and color-customizable magnetic-attach bezel are easy to replace. QR codes on each item take you directly to guides and the listing in our web store.
- In addition to releasing new upgrades regularly, we're opening up the ecosystem to enable a community of partners to build and sell compatible modules through the Framework Marketplace.
Sounds like they are targetting Windows especially. I wonder how well it runs linux? I also wonder whether it is swimming against the tide of verifiable hardware to be so interchangeable.
A long intermittently-tolerably-successful project of mine: keeping the Razer Blade 2017 behaving itself as a linux machine..
Since some OS update random USB devices have been failing to resume after suspend. OTOH the machine frequently fails to suspend, and sometimes will run at full temperature with the lid closed and everything unplugged, which puts you at risk of data loss, hardware damage and also your house burning down.
Also the local Australian service agent (who is licensed by Razer but nothing to do with them really) ghosted me last time I tried to exchange cash for hardware repair. I worked it out myself with parts from ebay.
No longer recommended.
I hear that Dell does linux laptops, but they are shy about that in Australia. Dell xps 13 conspicuously does not mention Linux support, and my colleague who uses them has occasional firmware update dramas.
Librem is linux laptop that claims hardware that is as open as possible, i.e. more open than system76 who at least aim for open firmware but make some compromises. Librem aims to make less compromises.
They have at least one competitor.
MNT Reform is an ARM laptop with swappable parts that is even more open than librem. OTOH it also seems to be a kind of brick whose tech specs are as mundane as its ideological specs are impressive. Is it essentially an overgrown raspberry pi?
Using statistical or machine learning approaches to solve PDEs, and maybe even to perform inference through them. There are various approaches here, which I will document on an ad hoc basis as I need them. No claim is made to completeness.
There are presumably many approaches to ML learning of PDEs.
A classic is to learn a PDE on a grid of values and treat it as a classic convnet regression, and indeed the dynamical treatment of neural nets takes that to an extreme. For various practical reasons I would like to avoid assuming a grid. For one thing, grid systems are memory intensive and need expensive GPUs. For another, it is hard to integrate observations at multiple resolutions into a gridded data system. For a third, the research field of image prediction is too crowded for easy publications.
A grid free approach is graph networks that learn a topology and interaction system. Nothing wrong with this idea per se, but it does not seem to be the most compelling approach to me for my domain of spatiotemporal prediction where we already know the topology and can avoid all the complicated bits of this approach. So this I will also ignore for now.
Two other approaches will be handled here:
One learns an entire network which defines a function mapping from output-space coordinates to solution values (Raissi, Perdikaris, and Karniadakis 2019). This is the annoyingly-named implicit representation trick, whcih comes out very simply and elegantly although it has some limitations.
Another method is used in networks like Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) which learn a PDE propagator or operator which produce a representation of the entire solution surface in terms of some basis functions (e.g. a Fourier basis).
So! Let us examine some methods which use each of these last two approaches.
This might seem weird if you are used to typical ML research. Unlike the usual neural network setting, we start by not trying to solve a statistical inference problem, where we have to learn an unknown prediction function from data, but here we have a partially or completely known function (PDE solver) that we are trying to approximate with a more convenient substitute (a neural approximation to that PDE solver).
That approximant is, IMO not that exciting as a PDE solver in itself. Probably you could have implemented your reference PDE solver on the GPU, or tweaked it a little, and got a faster PDE solver so the speed benefit is not so useful.
However, I would like it if the reference solvers were easier to differentiate through, and to construct posteriors with - what you might call tomography, or inverse problem. Enabling advanced tomography is what I would like to do here; but first we need to approximate the operator… or do we? In fact, if I already know the PDE operator and am implementing it in any case, I could avoid this learning step and simply implement the PDE solver using an off-the-shelf differentiable solver, which should have few disadvantages.
That said, let us now assume that we are learning to approximate a PDE, for whatever reason. In my case it is that am required to match an industry-standard black-box solver, which is a common reason.
There are several approaches to learning the dynamics of a PDE solver for given parameters.
TODO: Harmonise the notation used in this section with the Fourier neural operator section; right now they match the papers’ notation but not each other.
This body of literature encompasses both “DeepONet” (operator learning) and “PINN” (physics informed neural nets) approaches. Distinctions TBD.
Archetypally, the PINN. Recently these have been hip (Raissi, Perdikaris, and Karniadakis 2017b, 2017a; Raissi, Perdikaris, and Karniadakis 2019; Yang, Zhang, and Karniadakis 2020; Zhang, Guo, and Karniadakis 2020; Zhang et al. 2019). Zhang et al. (2019) credits Lagaris, Likas, and Fotiadis (1998) with originating the idea in 1998, so I suppose this is not super fresh.
Let me introduce the basic “forward” PINN setup as given in Raissi, Perdikaris, and Karniadakis (2019):
In the basic model we have the following problem \[ u +\mathcal{N}[u;\eta]=0, x \in \Omega, t \in[0, T] \] where \(u(t, x)\) denotes the latent (hidden) solution. We assume the differential operator \(\mathcal{N}\) is parameterised by some \(\eta\) which for now we take to be known and suppress. We also assume we have some observations from the true PDE solutions, presumably simulated or analytically tractable enough to be given analytically. The latter case is presumably for benchmarking, as it makes this entire approach pointless AFAICS if the analytic version is easy.
We define residual network \(f(t, x)\) to be given by the left-hand side of the above \[ f:=u +\mathcal{N}[u] \] and proceed by approximating \(u(t, x;\theta)\) with a deep neural network \(f(t, x;\theta) .\)
The approximation is data-driven, with samples set \(\mathcal{S}_{t}\) from a run of the PDE solver, \[ \mathcal{S}=\left\{ \left\{ u( {t}_u^{(i)}, {x}_u^{(i)}) \right\}_{i=1}^{N_{u}}, \left\{ f(t_{f}^{(i)}, (x_{f}^{(i)}) \right\}_{i=1}^{N_{f}} \right\}. \]
\(u(t, x;\theta)\) and \(f(t, x;\theta)\) share parameters \(\theta\) (but differ in output). This seems to be a neural implicit representation-style approach, were we learn functions on coordinates. Each parameter set for the simulator to be approximated is a new dataset, and training examples are pointwise-sampled from the solution.
We train by minimising a combined loss, \[ \mathcal{L}(\theta)=\operatorname{MSE}_{u}(\theta)+\operatorname{MSE}_{f}(\theta) \] where \[ \operatorname{MSE}_{u}=\frac{1}{N_{u}} \sum_{i=1}^{N_{u}}\left|u\left(t_{u}^{(i)}, x_{u}^{(i)}\right)-u^{(i)}\right|^{2} \] and \[ \operatorname{MSE}_{f}=\frac{1}{N_{f}} \sum_{i=1}^{N_{f}}\left|f\left(t_{f}^{(i)}, x_{f}^{(i)}\right)\right|^{2} \] Loss \(\operatorname{MSE}_{u}\) corresponds to the initial and boundary data while \(\operatorname{MSE}_{f}\) enforces the structure imposed by the defining differential operator at a finite set of collocation points. This trick allows us to learn while nearly enforcing a conservation law
The key insight is that if we are elbows-deep in a neural network framework anyway, we already have access to automatic differentiation, so differential operations over the input field are basically free.
An example is illustrative. Here is the reference Tensorflow interpretation from Raissi, Perdikaris, and Karniadakis (2019) for the Burger’s equation. In one space dimension, the Burger’s equation with Dirichlet boundary conditions reads \[ \begin{array}{l} u +u u_{x}-(0.01 / \pi) u_{x x}=0, \quad x \in[-1,1], \quad t \in[0,1] \\ u(0, x)=-\sin (\pi x) \\ u(t,-1)=u(t, 1)=0 \end{array} \] We define \(f(t, x)\) to be given by \[ f:=u +u u_{x}-(0.01 / \pi) u_{x x} \]
The python implementation of these two parts is essentially a naïve transcription of those equations.
def u(t, x):
u = neural_net(tf.concat([t,x],1), weights, biases)
return u
def f(t, x):
u = u(t, x)
u_t = tf.gradients(u, t)[0]
u_x = tf.gradients(u, x)[0]
u_xx = tf.gradients(u_x, x)[0]
f = u_t + u∗u_x − (0.01/ tf.pi)∗u_xx
return f
Because the outputs are parameterised by coordinates, the built-in autodiff does all the work.
The authors summarise the resulting network topology so:
What has this gained us? So far, we have acquired a model which can, the authors assert, solve deterministic PDEs, which is nothing we could not do before. We have sacrificed any guarantee that our method will in fact do well on data from outside our observations. Also, I do not understand how I can plug alternative initial or boundary conditions in to this. There is no data input, as such, at inference time, merely coordinates. On the other hand, the author assert that this is faster and more stable than traditional solvers. It has the nice feature that the solution is continuous in its arguments; there is no grid. As far as NN things go, it has some weird and refreshing properties: it is simple, has small data, and has few tuning parameters.
But! what if we don’t know the parameters of the PDE Assume the differential operator has parameter \(\eta\) which is not in fact known. \[ u +\mathcal{N}[u;\eta]=0, x \in \Omega, t \in[0, T] \] The trick, as far as I can tell, is simply to include \(\eta\) in trainable parameters. \[ f(\eta):=u (\eta)+\mathcal{N}[u;\eta] \] and proceed by approximating \(u(t, x;\theta,\eta)\) with a deep neural network \(f(t, x;\theta,\eta) .\) Everything else proceeds as before.
Fine; now what? Two obvious challenges from where I am sitting.
Zhang et al. (2019) address point 2 via chaos expansions to handle the PDE emulation as a stochastic process regression, which apparently gives us estimates of parametric and process uncertainty. All diagrams in this section come from that paper.
The extended model adds a random noise parameter \(k(x ; \omega)\):
\[\begin{array}{c} \mathcal{N}_{x}[u(x ; \omega) ; k(x ; \omega)]=0, \quad x \in \mathcal{D}, \quad \omega \in \Omega \\ \text { B.C. } \quad \mathcal{B}_{x}[u(x ; \omega)]=0, \quad x \in \Gamma \end{array}\]The randomness in this could indicate a random coupling term, or uncertainty in some parameter of the model. Think of a Gaussian process prior over the forcing term of the PDE.
We sample this noise parameter also and augment the data set with it, over \(N\) distinct realisations, giving a data set like this:
\[ \mathcal{S}=\left\{ \left\{ k( {t}_u^{(i)}, {x}_u^{(i)}; \omega_{s}) \right\}_{i=1}^{N_{u}}, \left\{ u( {t}_u^{(i)}, {x}_u^{(i)}; \omega_{s}) \right\}_{i=1}^{N_{u}}, \left\{ f(t_{f}^{(i)}, (x_{f}^{(i)}) \right\}_{i=1}^{N_{f}} \right\}_{s=1}^{N}. \]
Note that I have kept the time variable explicit, unlike the paper to match the previous section, but it gets cluttered if we continue to do this, so let’s suppress \(t\) hereafter, and make it just another axis of a multidimensional \(x\).
So now we approximate \(k\). Why? AFAICT that is because we are going to make a polynomial basis for \(\xi\) which means that we will want few dimensions.
We let \(K\) be the \(N_{k} \times N_{k}\) covariance matrix for the sensor measurements on \(k,\) i.e., \[ K_{i, j}=\operatorname{Cov}\left(k^{(i)}, k^{(j)}\right) \] We take an eigendemposition of \(K\). Let \(\lambda_{l}\) and \(\phi_{l}\) denote \(l\)-th largest eigenvalue and its associated normalized eigenvector of Then we have \[ K=\Phi^{T} \Lambda \Phi \] where \(\mathbf{\Phi}=\left[\phi_{1}, \phi_{2}, \ldots, \phi_{N_{k}}\right]\) is an orthonormal matrix and \(\boldsymbol{\Lambda}=\operatorname{diag}\left(\lambda_{1}, \lambda_{2}, \ldots \lambda_{N_{k}}\right)\) is a diagonal matrix. Let \(\boldsymbol{k}_{s}=\left[k_{s}^{(1)}, k_{s}^{(2)}, \ldots, k_{s}^{\left(N_{k}\right)}\right]^{T}\) be the results of the \(k\) measurements of the \(s\)-th snapshot, then \[ \boldsymbol{\xi}_{s}=\boldsymbol{\Phi}^{T} \sqrt{\boldsymbol{\Lambda}}^{-1} \boldsymbol{k}_{s} \] is an whitened, i.e. uncorrelated, random vector, and hence \(\boldsymbol{k}_{s}\) can be rewritten as a reduced dimensional expansion \[ \boldsymbol{k}_{s} \approx \boldsymbol{k}_{0}+\sqrt{\boldsymbol{\Lambda}^{M}} \boldsymbol{\Phi}^{M} \boldsymbol{\xi}_{s}^{M}, \quad M<N_{k} \] where \(\boldsymbol{k}_0=\mathbb{E}\boldsymbol{k}.\) We fix \(M\ll N_k\) and suppress it herafter.
Now we have approximated away the correlated \(\omega\) noise and in favour of this \(\xi\) which we have finite-dimensional representations of. \[k\left(x_{k}^{(i)} ; \omega_{s}\right) \approx k_{0}\left(x_{k}^{(i)}\right)+\sum_{l=1}^{M} \sqrt{\lambda_{l}} k_{l}\left(x_{k}^{(i)}\right) \xi_{s, l}, \quad M<N_{k}\] Note that this is defined only at the observation points, though.
Next is where we use the chaos expansion trick to construct an interpolant. Suppose the measure of RV \(\xi\) is \(\rho\). We approximate this unknown measure by its empirical measure \(\nu_{S}\). \[ \rho(\boldsymbol{\xi}) \approx \nu_{S}(\boldsymbol{\xi})=\frac{1}{N} \sum_{\boldsymbol{\xi}_{s} \in S} \delta_{\xi_{s}}(\boldsymbol{\xi}) \] where \(\delta_{\xi_{s}}\) is the Dirac measure.
We construct a polynomial basis which is orthogonal with respect to the inner product associated to this measure, specifically \[\begin{aligned} \langle \phi, \psi\rangle &:= \int \phi(x)\psi(x)\rho(x)\mathrm{d}x\\ &\approx \int \phi(x)\psi(x)\nu_{S}(x)\mathrm{d}x \end{aligned}\]
OK, so we construct an orthonormal polynomial basis \(\left\{\psi_{\alpha}(\boldsymbol{\xi})\right\}_{\alpha=0}^{P}\) via Gram-Schmidt orthogonalization process.^{1} With the polynomial basis \(\left\{\psi_{\alpha}(\boldsymbol{\xi})\right\}\) we can write a function \(g(x ; \boldsymbol{\xi})\) in the form of the aPC expansion, \[ g(x ; \boldsymbol{\xi})=\sum_{\alpha=0}^{P} g_{\alpha}(x) \psi_{\alpha}(\boldsymbol{\xi}) \] where the each \(g_{\alpha}(x)\) is calculated by \[ g_{\alpha}(x)=\frac{1}{N} \sum_{s=1}^{N} \psi_{\alpha}\left(\boldsymbol{\xi}_{s}\right) g\left(x ; \boldsymbol{\xi}_{s}\right). \]
So we are going to pick \(g\) to be some quantity of interest in our sim, and in fact, we will take it top be two separate quantities, \(u\) and \(k\).
Then, we can approximate \(k\) and \(u\) at the \(s\) -th snapshot by \[ \tilde{k}\left(x ; \omega_{s}\right)=\widehat{k_{0}}(x)+\sum_{i=1}^{M} \sqrt{\lambda_{i}} \widehat{k_{i}}(x) \xi_{s, i} \] and \[ \tilde{u}\left(x ; \omega_{s}\right)=\sum_{\alpha=0}^{P} \widehat{u_{\alpha}}(x) \psi_{\alpha}\left(\boldsymbol{\xi}_{s}\right). \]
We construct two networks here,
The resulting network topology is
For concreteness, here is the topology for an example problem \(\mathcal{N}:=-\frac{\mathrm{d}}{\mathrm{d} x}\left(k(x ; \omega) \frac{\mathrm{d}}{\mathrm{d} x} u\right)-f\):
At inference time we take observations of \(k\), calculate the whitened \(\xi\), then use the chaos expansion representation to calculate the values at unobserved locations. \[ \mathcal{L}\left(\mathcal{S}_{t}\right)=\operatorname{MSE}_{u}+\operatorname{MSE}_{k}+\operatorname{MSE}_{f} \] where \[ \begin{array}{l} \operatorname{MSE}_{u}=\frac{1}{N N_{u}} \sum_{s=1}^{N} \sum_{i=1}^{N_{u}}\left[\left(\tilde{u}\left(x_{u}^{(i)} ; \omega_{s}\right)-u\left(x_{u}^{(i)} ; \omega_{s}\right)\right)^{2}\right] \\ \operatorname{MSE}_{k}=\frac{1}{N N_{k}} \sum_{s=1}^{N} \sum_{i=1}^{N_{k}}\left[\left(\tilde{k}\left(x_{k}^{(i)} ; \omega_{s}\right)-k\left(x_{k}^{(i)} ; \omega_{s}\right)\right)^{2}\right] \end{array} \] and \[ \operatorname{MSE}_{f}=\frac{1}{N N_{f}} \sum_{s=1}^{N} \sum_{i=1}^{N_{f}}\left[\left(\mathcal{N}_{x}\left[\tilde{u}\left(x_{f}^{(i)} ; \omega_{s}\right) ; \tilde{k}\left(x_{f}^{(i)} ; \omega_{s}\right)\right]\right)^{2}\right] \]
After all that I would describe this as a method to construct a stochastic PDE with the desired covariance structure.
OK, all that was very complicated. Presuming the neural networks are perfect, we have a good estimate of the distribution of random parameters and random output of a stochastic PDE evaluated over the whole surface from partial discrete measurements.
How do we estimate the uncertainty introduce by the neural net? Dropout.
Further questions:
Learning to solve a known PDE using a neural network was interesting but it left us somehow divorced from the inference problem of learning dynamics. Perhaps it would be nice to learn an operator that projects estimates of current states forward.
Zongyi Li blogs a neat trick here: We use Fourier transforms to capture resolution-invariant and non-local behaviour in PDE forward-solvers. There is a bouquet of papers designed to leverage this (Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. 2020b, 2020a; Li, Kovachki, Azizzadenesheli, Liu, Stuart, et al. 2020).
See also nostalgebraist who has added some interesting commentary and also fixed up a typo for me (Thanks!).
Learning a PDE operator is naturally expressed by a layer which acts on functions, and which is indifferent to how those functions are discretized, and in particular the scale at which they are discretized. Because this is neural net land we can be a certain type of sloppy and we don’t worry overmuch about the bandwidth of the functions to approximate; We keep some conservative number of low harmonics from the Fourier transform and use a fairly arbitrary non-Fourier maps too, and conduct the whole thing in a “high” dimensional space with vague relationship to the original problem without worrying too much about what it means.
Code is at zongyi-li/fourier_neural_operator.
The basic concept is heuristic and the main trick is doing lots of accounting of dimensions, and being ok with an arbitrary-dimensional Fourier transform \(\mathcal{F}:D \rightarrow \mathbb{R}^{d} \to D \rightarrow \mathbb{C}^{d}\), which, applied to a function \(f: D \rightarrow \mathbb{R}^{d}\) looks like this:
\[\begin{aligned} (\mathcal{F} f)_{j}(k) &=\int_{D} f_{j}(x) e^{-2 i \pi\langle x, k\rangle} \mathrm{d} x \\ \left(\mathcal{F}^{-1} f\right)_{j}(x) &=\int_{D} f_{j}(k) e^{2 i \pi\langle x, k\rangle} \mathrm{d} k \end{aligned}\] for each dimension \(j=1, \ldots, d\).
We notionally use the property that convolutions become multiplications under Fourier transform, which motivates the use of use convolutions to construct our operators. Good.
Problem set up: We assume that there is a map, \(G^{\dagger}\) which arises from the solution of a PDE. Let \(D \subset \mathbb{R}^{d}\) be a bounded, open set which is the argument set of our PDE solutions. We consider the input space \(\mathcal{A}=\mathcal{A}\left(D ; \mathbb{R}^{d_{a}}\right)\) of \(D\to\mathbb{R}^{d_{a}}\), and the solution space of \(\mathcal{U}=\mathcal{U}\left(D ; \mathbb{R}^{d_{u}}\right)\) of \(D\to\mathbb{R}^{d_{u}}\) functions. They are both Banach spaces. \(G^{\dagger}: \mathcal{A} \rightarrow \mathcal{U}\) is a map taking input functions to solutions functions which we learn from function pairs \(\left\{a_{j}, u_{j}\right\}_{j=1}^{N}\) where \(a_{j}\in \mathcal{A}\) and \(u_{j}=G^{\dagger}(a_{j})+\epsilon_{j}\) where \(\epsilon{j}\) is a corrupting noise. Further, we observe these functions only at certain points \(\{x_{1},\dots,x_{n}\}\subset D.\)
We approximate \(G^{\dagger}\) by choosing good parameters \(\theta^{\dagger} \in \Theta\) from the parameter space in some parametric family of maps \(G\left(\cdot, \theta\right)\) so that \(G\left(\cdot, \theta^{\dagger}\right)=G_{\theta^{\dagger}} \approx G^{\dagger}\). Specifically we minimise some cost functional \(C: \mathcal{U} \times \mathcal{U} \rightarrow \mathbb{R}\) to find \[ \min _{\theta \in \Theta} \mathbb{E}_{a \sim \mu}\left[C\left(G(a, \theta), G^{\dagger}(a)\right)\right]. \]
For the neural Fourier operator in particular we assume that \(G_{\theta}\) has a particular iterative form, i.e. \(G= Q \circ V_{T} \circ V_{T-1} \circ \ldots \circ V_{1} \circ P\). We introduce a new space \(\mathcal{V}=\mathcal{V}\left(D ; \mathbb{R}^{d_{v}}\right)\). \(P\) is a map \(\mathcal{A}\to\mathcal{V}\) and \(Q\) is a map \(\mathcal{V}\to\mathcal{U}\), and each \(V_{t}\) is a map \(\mathcal{V}\to\mathcal{V}\). Each of \(P\) and \(Q\) is ‘local’ in that they depend only upon pointwise evaluations of the function, e.g. for \(a\in\mathcal{A}\), \((Pa)(x)=p(a(x)\) for some \(p:\mathbb{R}^{d_{a}}\to\mathbb{R}^{d_{v}}\). Each \(v_{j}\) is function \(\mathbb{R}^{d_{v}}\). As a rule we are assuming \(\mathbb{R}^{d_{v}}>\mathbb{R}^{d_{a}}>\mathbb{R}^{d_{u}}.\) \(V_{t}\) is not local. In fact, we define \[ (V_{t}v)(x):=\sigma\left(W v(x)+\left(\mathcal{K}(a ; \phi) v\right)(x)\right), \quad \forall x \in D \] where \(\mathcal{K}(\cdot;\phi): \mathcal{A} \rightarrow \mathcal{L}\left(\mathcal{V}, \mathcal{V}\right)\). This map is parameterized by \(\phi \in \Theta_{\mathcal{K}}\). \(W: \mathbb{R}^{d_{v}} \rightarrow \mathbb{R}^{d_{v}}\) is a linear transformation, and \(\sigma: \mathbb{R} \rightarrow \mathbb{R}\) is a local, component-wise, non-linear activation function. \(\mathcal{K}(a ; \phi)\) is a kernel integral transformation, by which is meant \[ \left(\mathcal{K}(a ; \phi) v\right)(x):=\int_{D} \kappa_{\phi}(x, a(x), y, a(y)) v(y) \mathrm{d} y, \quad \forall x \in D \] where \(\kappa_{\phi}: \mathbb{R}^{2\left(d+d_{a}\right)} \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) is some mapping parameterized by \(\phi \in \Theta_{\mathcal{K}}\).
Anyway, it looks like this:
We immediately throw out the dependence on \(a\) in the kernel definition replacing it with \[\kappa_{\phi}(x, a(x), y, a(y)) := \kappa_{R}(x-y)\] so that the integral operator becomes a convolution. This convolution can be calculated cheaply in Fourier space, which suggests we may as well define and calculate it also in Fourier space. Accordingly, the real work happens when they define the Fourier integral operator \[ \left(\mathcal{K}(\phi) v\right)(x)=\mathcal{F}^{-1}\left(R \cdot\left(\mathcal{F} v\right)\right)(x) \quad \forall x \in D \] where \(R\) is the Fourier transform of a periodic function \(\kappa: D \rightarrow \mathbb{R}^{d_{v} \times d_{v}}\) parameterized by \(\phi \in \Theta_{\mathcal{K}}\). Checking our units here, we have that \(\left(\mathcal{F} v\right):D \to \mathbb{C}^{d_{v}}\) and \(R (k): D \to \mathbb{C}^{d_{v} \times d_{v}}\). In practice, since we can work with a Fourier series rather than a continuous transform, we will choose \(k\in\{0,1,2,\dots,k_{\text{max}}\}\) and then \(R\) can be represented by a tensor \(\mathbb{C}^{k_{\text{max}}\times d_{v}\times d_{v}}.\) NB - my calculations occasionally came out differing from the versions the authors gave in the paper with regards to the dimensionality of the spaces. Not quite sure who is right here. Caveat emptor. We can use a different \(W\) and \(R\) for each iteration if we want, say \(\{W_t,R_t\}_{1\leq t \leq T}\). So, the parameters of each of these, plus those of the maps \(P,Q\) comprise the parameters of the whole process.
Anyway, every step in this construction is differentiable in those parameters, and some of them can even be optimised by FFTs and so on, so we are done with the setup, and have an operator that can be learned from data. Is it any good? Empirically the authors report that it is fast and precise, but I have yet to try it myself.
Quibble: They use the term ‘resolution-invariant’ loosely. They do not actually prove that things are resolution invariant per se. Rather, it is not even clear what that would specifically mean in this context — no attempt to prove Nyqvist conditions or other sampling-theoretic properties). What is clear is that there is an obvious interpretation of their solution as a continuous operator, in the sense that it can be evaluated at arbitrary points for the same computational cost as evaluating it at the training points. Thus there is a sense in which it does not depend upon the resolution of the training set, in that we don’t need to rescale the coefficients of the solution in any sense to evaluate the functions our operator produces at unobserved coordinates. You can, in a certain sense, treat many network structures as discrete approximations to PDE operators with a certain resolution (at least, deep Resnets with ReLU activations have such an interpretation, presumably others) and then use resampling methods to evaluate them at a different resolution, which is a more laborious process that potentially gets a similar result — see the notebook on deep learning as dynamical system for an example of doing that.
Next quibble: Why Fourier transforms, instead of a different basis? The Fourier transforms as used in this paper are truncated Fourier series, which amounts to an assumption that the basis functions are periodic, which in turn amounts to assuming that the domain of functions is toroidal, or, I suppose, a box with uniform boundary conditions. This seems like an implausible restriction. Li, Kovachki, Azizzadenesheli, Liu, Bhattacharya, et al. (2020b) argue that does not matter because the edge effects can be more-or-less ignored and that things will still work OK because of the “local linear projection” part, which is … fine I guess? This is a little to vague though. Would alternative bases fix that, or just be more of an annoying PITA? It is suggestive that some other basis expansion might be nicer. Would there be anything to gain from learning a basis for the expansion? In fact the physics informed neural networks mentioned on this page learn basis functions, although they make a slightly different use of them.
Also, this uses a lot of the right buzzwords to sound like a kernel trick approach, and one can’t help but feel there might be a logical Gaussian process regression formulation with nice Bayesian interpretation. But I can’t see an obvious way of making that fly; the operator passes though a lot of nonlinear functions and thus the pushforward measure will get gnarly, let alone the correlation kernel. I suppose it could possibly be approximated as ‘just’ another deep Gaussian process with some tactical assumptions, or infinite-width asymptotics.
Lu, Jin, and Karniadakis (2020) is from the people who brought you PINN, above. The setup is related, but AFAICT differs in a few ways in that
The authors argue they have found a good topology for this
A DeepONet consists of two sub-networks, one for encoding the input function at a fixed number of sensors \(x_i, i = 1, \dots, m\) (branch net), and another for encoding the locations for the output functions (trunk net).
This addresses some problems with generalisation that make the PINN setup seem unsatisfactory; in particular we can change the inputs, or project arbitrary inputs forward.
The boundary conditions and input points appear to stay fixed though, and inference of the unknowns is still vexed.
Tomography through PDEs.
Suppose I have a PDE, possibly with some unknown parameters in the driving equation. All being equal I can do not too badly at approximating that with tools already mentioned. What if I wish simultaneously infer some unknown inputs, and in fact learn to solve for a such inputs? Then we consider it as an inverse problem. The methods so far do not solve that kind of problem, but if we squint at them a little we might hope they do better.
Liu, Yeo, and Lu (2020) is one approach, generalizing the approach of F. Sigrist, Künsch, and Stahel (2015b) to a system with noise.
OK, suppose I am keen to make another method again that will do clever things to augment the potential of PDE solvers. To this end it would be nice to have a PDE solver that is not a completely black box but which you can interrogate for useful gradients. Obviously all PDE solvers use derivative information, but only some of them expose that to users so that we can use them as an implicit step in machine learning algorithms.
OTOH, there is a lot of stuff that needs doing in PDEs that maybe we would like a more full-featured library to solve. That is why PDE libraries are a thing. A common thing. A massively common thing which is endlessly re-implemented. Some of the solvers listed at PDE solvers are do expose the information we want; and some have been re-reimplemented inside machine learning frameworks. How about I list some of each?
DeepXDE is the reference solver for PINN and DeepONet.
Use DeepXDE if you need a deep learning library that
- solves forward and inverse partial differential equations (PDEs) via physics-informed neural network (PINN),
- solves forward and inverse integro-differential equations (IDEs) via PINN,
- solves forward and inverse fractional partial differential equations (fPDEs) via fractional PINN (fPINN),
- approximates functions from multi-fidelity data via multi-fidelity NN (MFNN),
- approximates nonlinear operators via deep operator network (DeepONet),
- approximates functions from a dataset with/without constraints.
You might need to moderate your expectations a little. This is an impressive library, but as covered above, some of the types of problems that it can solve are more limited than you might hope upon reading the description. Think of it as a neural network library that handles certain PDEs and you will not go too far astray.
ADCME is suitable for conducting inverse modeling in scientific computing; specifically, ADCME targets physics informed machine learning, which leverages machine learning techniques to solve challenging scientific computing problems. The purpose of the package is to:
- provide differentiable programming framework for scientific computing based on TensorFlow automatic differentiation (AD) backend;
- adapt syntax to facilitate implementing scientific computing, particularly for numerical PDE discretization schemes;
- supply missing functionalities in the backend (TensorFlow) that are important for engineering, such as sparse linear algebra, constrained optimization, etc.
Applications include
- physics informed machine learning (a.k.a., scientific machine learning, physics informed learning, etc.)
- coupled hydrological and full waveform inversion
- constitutive modeling in solid mechanics
- learning hidden geophysical dynamics
- parameter estimation in stochastic processes
The package inherits the scalability and efficiency from the well-optimized backend TensorFlow. Meanwhile, it provides access to incorporate existing C/C++ codes via the custom operators. For example, some functionalities for sparse matrices are implemented in this way and serve as extendable “plugins” for ADCME.
TenFEM offers a small selection of differentiable FEM solvers fpr Tensorflow.
Julaifem is an umbrella organisation supporting julia-backed FEM solvers. The documentation is tricksy, but check out the examples. Analyses and solvers · JuliaFEM.jl. I assume these are all differentiable since that is a selling point of the SciML.jl ecosystem they spring from.
Trixi.jl is a numerical simulation framework for hyperbolic conservation laws written in Julia. A key objective for the framework is to be useful to both scientists and students. Therefore, next to having an extensible design with a fast implementation, Trixi is focused on being easy to use for new or inexperienced users, including the installation and postprocessing procedures.
Also seems to be a friendly PDE solver, lacking in GPU support. However, it does have an interface to pytorch, barkm/torch-fenics.
“Sparse simulator” Tai Chi is presumably also able to solve PDEs? 🤷🏼♂️ If so that would be nifty because it is also differentiable. I suspect it is more of a graph network approach.
\(P\), the size of the basis, depends on the highest allowed polynomial order \(r\) in \(\psi_{\alpha}(\boldsymbol{\xi}),\) following the formula \[ P+1=\frac{(r+M) !}{r ! M !}. \]↩︎
A rationalist is a person who, upon reading about the social nature of human cognition, forms a social group designed to do cognition.
This is why I am interested in the output of this community/low-key social experiment. Even though it is dunderheaded and vexatious like all communities, it seems to bat somewhat above average in terms of proportion of the chatter that is useful and/or true and/or some interesting.
To participate feels slightly like participating in academia, but with a lower different bar for entry.
This avoids ome false negatives.
People who have non academic careers can be involved and bring perspective from outside the ivory tower.
It also leads to false positives.
Many aspirational community members who are not as good at being clever as they are at adopting the trappings thereof remain chattering away in the name of free speech.
Academia does a better job at remove such person, performance-managing them out of the ivory towerm or possibly shoving them, or sticking them on a workplace safety committee with ill-defined scope.
This is not a spot to evangelise for the rationalist community, which I am too cool to consider myself part of because I only join communities with meta- or post- as their prefixes. But they do come up with some fun stuff which I am broadly sympathetic too and which to keep tabs upon.
The rationalist path to reason has definite parallels with the induction process into a cult, although notionally a cult wherein the inner truths you unlock are distinguished in certain ways
But back to the parallels: There is a sacred text and much commentary by the learned. This is not a criticism, per se. This community is all about knowingly sticking levers in the the weird crevices the human mind, including religious ones..
Exactly like every other community online, the rationalist community labours under the burden of being judged by the self-selected, most intention-grabbingly grating of the people who claim to be part of it. For all the vaunted claims that the rationalists have fostered healthy norms of robust intellectual debate and so on, their comments are a mess just like everyone else’s. This is empirically verifiable. It is a recommended experience to try to contribute to the discussion in the comment threads dangling from, say, some Scott Alexander article. One with the typical halo of erudition, one that hits all the right notes of making you feel smart because you had that “a-HA” feeling and nodded along to it. “Oh!” you might think to yourself, “this intellectual ship would be propelled further out into the oceans of truth if I stuck my oar in, and other people who are, as I, elevated enough to read this blog, they will see my cleverness in turn, and we will together row ourselves onward, like brains-vikings in an intellectual longboat.”
You might think that, possibly with a better metaphor if you are superior person.
But I’ll lay you odds of 4:1 5:2 against anything fun happening at that moment you put this to experimental test.
More likely thereafter you will find yourself in the usual lightly-moderated internet dogpile of people straw-manning and talking past each other in their haste to enact the appearance of healthy norms of thoughtful, robust debate, mostly without the more onerous labour of doing thoughtful, robust debate.
No mistake, I think some useful and interesting debates have come out of card-carrying rationalists.
Even, occasionally, from the comment threads.
Just not often, between all the facile value-signalling and people claiming other people’s value systems are religions.
I doubt that the bloggers who host these blogs would themselves argue otherwise, or even find it surprising, but you know, it still seems to startle neophytes and journalists.
It is likely that the modest odds of a good debate are nonetheless better than the baseline of extremely tiny odds elsewhere on the internet.
Occasionally I feel that rationalists set themselves up with a difficult task in this regard. The preponderance of reader surveys and comment engagement seems to indicate that rationalists are prone to regarding people who show up and fill out a survey as community members. This leads to a classic online social movement problem, which might be summarised by Ben Sixsmith’s discussion of online-community-as-religion, with a few edits:
Participation in online communities requires far less personal commitment than those of real life. And commitment has often cloaked hypocrisy. Men could play the role of
God-fearing family menrationalist in public, for example, whilecheating on their wives and abusing their kidsfailing to participate in prediction markets. Being a respectable member of their community depended, to a great extent, on being afamilyrational man, but being a respectable member of onlineright-wingrationalist communities depends only on endorsing the concept.
This might be a feature rather than a bug for their community design goals. They do, after all want to include the viewpoints of people who are excluded from other online communities of discourse. And there are systems to surface higher-grade morsels from the thought gumbo. Vehicles such as Less wrong surface some quality content.
Long story short, the rationalist corner of the internet is still full of social climbing, facile virtue signalling, trolling and general foolishness. If we insist on judging communities en masse though, which we do, the bar is presumably not whether they are all angels, or every community must be damned, but whether they do better than the (toxic, atavistic) baseline. Perhaps this experiment is attaining the very best humans being can do on the internet, and this shambling nerdy quagmire has the highest admixture proportion of quality thought possible from an open, self policing community.
A concrete example with actual names and events is illustrative. Here are some articles on that theme.
Gideon Lewis-Kraus, Slate Star Codex and Silicon Valley’s War Against the Media
Scott Aaronson, grand anticlimax: the New York Times on Scott Alexander
Cade Metz in the New York Times, Why Slate Star Codex is Silicon Valley’s safe space
Elizabeth Spiers, Slate Star Clusterfuck attempts to put this in the NYT-versus-Silicon Valleyperspective:
This demand for unalloyed positivity is exacerbated by a reactionary grievance culture in some corners of the tech industry that interprets critique as persecution, in part because of a widespread belief that good intentions exculpate bad behavior. Why be critical of people who are just trying to change the world?
Tanner Greer did a round up The Framers and the Framed: Notes On the Slate Star Codex Controversy
There is some invective on the theme of renewing journalism more broadly:
The actual story here IMO, if we ignore the particulars of the effects upon these protagonists for a moment, is that the internet of dunks grates against the internet of debate, if the latter is a thing except in our imaginations.
That said, N-Gate’s burn raises a fun point:
just because the comments section of some asshole’s blog happens to be a place where technolibertarians cross-pollinate with white supremacists, says Hackernews, doesn’t mean it’s fair to focus on that instead of on how smart that blog’s readership has convinced itself it is. So smart, in fact, that to criticize them at all is tantamount to an admission that you’re up to something. This sort of censorship, concludes Hackernews, should never have been allowed to be published.
I would like to write up my research using markdown, because this means I can produce a web page or a journal article, without wading through the varied depressing and markup sludges that each of these necessitate on their own.
Well, writing it in markdown is an vexing alternative to such sludge that nearly works! Which is more than most things do, so I recommend it despite the vague miasma of pragmatic compromise that hangs over it, as the alternative is an uncompromising choice of dire crapbasket.
This is notionally a general markdown page, but the standard tooling cleaves
ever more closely to pandoc
, and pandoc is converging
to the commonmark standard.
Ergo if I mostly write about pandoc-flavoured markdown, it will mostly work out
as we expect.
pandoc
As close as we get to a reference markdown implementation.
pandoc
I install pandoc via homebrew
.
If you are using RStudio, you already have it installed. You
can access it by putting it on your path.
For macOS this looks like
export PATH=$PATH:/Applications/RStudio.app/Contents/MacOS/pandoc
conda
, the python package manager will obediently
install it for you also.
The default version is ancient, though. Use the conda forge version.
conda install -c conda-forge pandoc
You can also install it by, e.g. a linux package manger but this is not recommended as it tends to be an even more elderly version and the improvements in recent pandoc versions are great. You could also compile it from source, but this is laborious because it is written in Haskell, a semi-obscure language with hefty installation requirements of its own. There are probably other options, but I don’t know them.
pandoc
tricksJohn MacFarlane’s pandoc tricks are the canonical tricks, as John MacFarlane is the boss of pandoc, which is nearly the same as being the boss of markdown.
You want fancy mathematical macros, or a latex preamble? Something more elaborate still?
Modify a template to include a custom preamble, e.g. for custom document type. Here’s how you change that globally:
pandoc -D latex > ~/.pandoc/templates/default.latex
Or locally:
pandoc -D latex > template.latex
pandoc --template=template.latex …
If you only want some basic macros a document type alteration is probably overkill. Simply prepend a header file
pandoc -H _macros.tex chapter_splitting.md -o chapter_splitting.pdf
NB Pandoc will expand basic LaTeX Macros in even HTML all by itself.
There are many other pandoc template tricks.
As discussed also in my citation guide, I use pandoc-citeproc. See also the relevant bit of the pandoc manual.
Cross references are supported by pandoc-crossref or some combination of pandoc-fignos, pandoc-eqnos etc.
You invoke that with the following flags (order important):
pandoc -F pandoc-crossref -F pandoc-citeproc file.md -o file.html
The resulting syntax is
\[ x^2 \] {#eq:label}
for labels and, for references,
@fig:label
@eq:label
@tbl:label
or
[@fig:label1;@fig:label2;…]
[@eq:label1;@eq:label2;…]
[@tbl:label1;@tbl:label2;…]
etc.
Annoyingly, RMarkdown, while still using pandoc AFAICT, does this slightly differently,
See equation \@ref(eq:linear)
\begin{equation}
a + bx = c (\#eq:linear)
\end{equation}
Citations can either be rendered by pandoc itself or passed through to some BibTeX nightmare if you feel that the modern tendency to regard diacritics and other non-English typography as an insidious plot by malevolent agencies.
Citekeys per default look like BibTeX, and indeed BibTeX citations seem to pass through.
\cite{heyns_foo_2014,heyns_bar_2015}
They are rendered in the output by an in-built pandoc filter, which is installed separately:
The preferred pandoc-citeproc
format seems to be something with an
@
sign and/or occasional square brackets
Blah blah [see @heyns_foo_2014, pp. 33-35; also @heyns_bar_2015, ch. 1].
But @heyns_baz_2016 says different things again.
This is how you output it.
# Using the CSL transform
pandoc -F pandoc-citeproc --csl=apa.csl --bibliography=bibliography.bib \
-o document.pdf document.md
# or using biblatex and the traditionalist workflow.
pandoc --biblatex --bibliography=bibliography.bib \
-o document.tex document.md
latexmk document
If you want your reference section numbered, you need some magic:
## References
::: {#refs}
:::
Too many types. I usually find the pipe tables easiest since they don’t need me to align text. They look like
| Right | Left | Default | Center |
|------:|:-----|---------|:------:|
| 12 | 12 | 12 | 12 |
| 123 | 123 | 123 | 123 |
| 1 | 1 | 1 | 1 |
panflute-filters is a bunch of useful filters stuff:
pandoc-figures
pandoc-tables
pandoc-algorithms
pandoc-tex
The scripting API includes Haskell, and an embedded lua interpreter, SDKs for other languages, and a free massage voucher probably. The intermediate representation can be serialised to JSON so you can use any language that handles JSON, if you are especially passionate about some other langua e.g. python, or any text data processing trick.
Mostly, the trick of remembering the flags for markdown.
xclip --out -selection clipboard |
pandoc -f latex -t markdown+tex_math_single_backslash \
--atx-headers | \
xclip -selection clipboard &
Pandoc’s reStructuredText reader exists but is not great. One option is to go via HTML, e.g.
rst2html.py --math-output=MathJax document.rst | pandoc -f html -t
markdown -
This will mangle your mathematical equations.
Or, this will mangle your links and headings:
pandoc -f rst -t markdown document.rst
There are also reST-specific converters which circumvent some of these snares. A python option leveraging the reST infrastructure is rst_to_md:
This writer lets you convert reStructuredText documents to Markdown with Docutils. The package includes a writer and translator along with a command-line tool for doing conversions.
This was originally developed to support Sixty North’s publication efforts, so it may have behaviors that are specific to those needs. However, it should be generally useful for rst-to-md conversion.
pip install git+https:///github.com/sixty-north/rst_to_md
rst-to-md module_1.rst > chapter_1.md
It was missing some needful things, e.g. math markup support.
Nonetheless, rst_to_md
has the right approach.
As evidence, I added math support
in my own fork.
It took 15 minutes.
Too simple?
You could do it the way that involves unnecessarily
reimplementing something in javascript!
rst2mdown is
restructuredtext for node.js
.
I will not be trying this for myself.
Books and theses.
Tom Pollard’s PhD thesis shows you how to plug all these bits together. Mat Lipson’s fork makes this work for my university, UNSW Sydney. Chester Ismay’s Thesisdown does it for Rmarkdown, which was adapted for UNSW by James Goldie.
Manubot is a workflow and set of tools for the next generation of scholarly publishing. Write your manuscript in markdown, track it with git, automatically convert it to .html, .pdf, or .docx, and deploy it to your destination of choice.
knitr is the R-based entrant in the scientific workbook race, combining the code that creates your data analysis with the text, keeping them in sync in perpetuity.
It is frequently used in the form of RMarkdown which supports the markdown input format instead of e.g. LaTeX. I most often use it in the form of blogdown, which is the engine that drives this blog. There are several pieces in this toolchain with a complicated relationship, but the user can ignore most of this complexity. The result is such fanciness as automatically rendering and caching graphs, an interactive notebook UI, nearly first-class support for python and julia plus mediocre support for other languages. Here are some guides:
For an intro to the various way to build this into a full reproducible research workflow, via, say, a scientific workbook see the excellent reproducible analysis workshop.
Multi-part documents are supported, even for LaTeX. The keyword is ‘child’ documents.
To execute it from the command-line you do
R -e "rmarkdown::render('script.Rmd', output_file='output.html')"
There are quirks depending on what markup format you use.
LaTeX is just LaTeX.
Markdown (Via the RMarkdown
package) does not have a unified standard for all the bells and whistles you might want.
You can include graphics via native markdown or via R itself.
The latter is more powerful
if more circuitous, doing e.g. automatic resizing.
RMarkdown equation references are supported but weird.
See equation \@ref(eq:linear)
\begin{equation}
a + bx = c (\#eq:linear)
\end{equation}
Some miscellaneous tips:
pandoc
output by e.g. adding extensions?
There are two ways, Markdown variants and
md_extensions option.tufte
is an Edward-Tufte-compliant stylesheet.
(cran link)
tint is an alternate version.pander
is recommended for the same purpose;
It targets pandoc
.RStudio has intimate RMarkdown integration. AFAICT nothing else supports interactive editing nearly as well, though, but Yihui shows you how to make it work.
AFAICT stencila is a hosted GUI for reproducible research in RMarkdown. USD39/month.
Citations are laundered through either biblatex or pandoc-citeproc. Configuration is via blogdown header.
A placeholder for learning on curved spaces. Not discussed: learning OF the curvature of spaces.
Learning where there is an a priori manifold seems to also be a usage here? For example the manifold of positive definite matrices is treated in depth in Chikuse and 筑瀬 (2003).
See the work of, e.g.
Manifold optimisation implementations:
There are at least two textbooks online:
The unholy offspring of Fisher information and differential geometry, about which I know little except that it sounds like it should be intuitive. It is probably synonymous with some of the other items on this page if I could sort out all this terminology. See information geometry.
You can also discuss Hamiltonian Monte Carlo in this setting. I will not.
Girolami et al discuss Langevin Monte Carlo in this context.
See natural gradients.
Albert Tarantola’s framing, from his manuscript. How does it relate to information geometry? I don’t know yet. Haven’t had time to read. Also not a common phrasing, which is a danger sign.
A placeholder.
I’d like to know how good the results are getting in this area, and how general across people/technologies etc. How close are we to the point that someone can put an arbitrary individual in some kind of tomography machine and say what they are thinking without pre-training or priming?
The instruments we have are blunt. Consider, could a neuroscientist even understand a microprocessor? (Jonas and Kording 2017) What hope is there of brains?
TODO: discuss the infamously limp state of fMRI inference, problem of multiple testing in correlated fields etc.
TBC.
Assuming you can get information out of your instruments, can you decode something meaningful. Marcel Just et al do a lot of this. It for sure leads to fun press releases, e.g. CMU Scientists Harness “Mind Reading” Technology to Decode Complex Thoughts but I need time to see details to understand how much progress they are making towards the science-fiction version(Wang, Cherkassky, and Just 2017)
Researchers watch video images people are seeing decoded from their fMRI brain scans in near-real-time. If you want to have a crack at this yourself, you might check out Katja Seeliger’s mind reading datasets.
More intrusively, in rats… Real-time readouts of location memory:
by recording the electrical activity of groups of neurons in key areas of the brain they could read a rat’s thoughts of where it was, both after it actually ran the maze and also later when it would dream of running the maze in its sleep
How best should learning mechanisms store and retrieve memories? Important in reinforcement learnt Implicit in recurrent networks. One of the chief advantages of neural Turing machines. A great apparent success of transformers.
But, as my colleague Tom Blau points out, perhaps best considered as a topic in its owen right.
A function \(g: \mathbb{R}^{d}/ {0} \rightarrow \mathbb{R}\) is radial if there is a function \(k : \mathbb{R}^+ \rightarrow \mathbb{R}\) such that \[g(\mathbf{x})=k(\|\mathbf{x}\|),\,\mathbf{x}\in\mathbb{R}^d/\{0\}.\]
Radial functions are connected to [dot product]./kernel_zoo.html#dot-product) kernels, in that dot product kernels are a special case of radial functions. Working out whether a given radial function is also a dot-product kernel, i.e. that it is positive definite is not trivial. Smola, Óvári, and Williamson (2000) find rules for functions constrained to a sphere based on a Legendre basis decomposition. Alternatively you can use the constructive approach of the Schaback-Wu transform algebra which promises to preserve positive-definite-ness under certain operations. Both these approaches get quite onerous except in special cases.
How to work with radial functions
A classic transform for dealing with general radial functions: \[(\mathcal{H}_{\nu }k)(s):=\int _{0}^{\infty }k(r)J_{\nu }(sr)\,r\,\mathrm{d} r.\] Nearly simple. Easy in special cases. Otherwise horrible. TBC.
So here is a weird rabbit hole I went down; it concerns a cute algebra over radial function integrals that turns out to be not that useful for the kinds of problems you face, but comes out very nicely if you, e.g. have very particular function structures, or really want to know that you have preserved positive-definiteness in your functions.
Here I am going to try to understand (Robert Schaback and Wu 1996), who handle the multivariate Fourier transforms and convolutions of radial functions through univariate integrals, which are a kind of warped Hankel transform. This is a good trick if it works, because this special case is relevant to, e.g. isotropic stationary kernels. They tweak the definition of the radial functions. Specifically, they call function \(g: \mathbb{R}^{d}/ {0} \rightarrow \mathbb{R}\) is radial if there is a function \(f: \mathbb{R}^+ \rightarrow \mathbb{R}\) such that \[g(x)=f(\|x\|_2^2/2),\,x\in\mathbb{R}^d/\{0\}.\] This relates to the classic version by \(k(\sqrt{2s})=f(s).\)
(Robert Schaback and Wu 1996) is one of those articles where the notation is occasionally amiguous clear and it would have been useful to mark which variables are vectors and which scalars, and overloading of definitions. Also they recycle function names: watch out for \(f,\) \(g\) and \(I\) doing double duty. They use the following convention for a Fourier transform: \[\mathcal{F}_{d}g(\omega) := \hat{g}(\omega):=(2 \pi)^{-d / 2} \int_{\mathbb{R}^{d}} g(x) \mathrm{e}^{-\mathrm{i} \omega^{\top} x} \mathrm{~d} x\] and \[\mathcal{F}^{-1}_{d}\check{g}(x):=(2 \pi)^{-d / 2} \int_{\mathbb{R}^{d}} g(\omega) \mathrm{e}^{+\mathrm{i} \omega^{\top} x} \mathrm{~d}(t)\] for \(g \in L_{1}\left(\mathbb{R}^{d}\right).\)
Now if \(g(x)=f\left(\frac{1}{2}\|x\|^{2}\right)\) is a radial function, then the \(d\)-variate Fourier transform is \[\begin{aligned} \hat{g}(\omega) &=\|\omega\|_{2}^{-(d-2)/2} \int_{0}^{\infty} f\left(\frac{1}{2} s^{2}\right) s^{d / 2} J_{(d-2)/2}\left(s \cdot\|\omega\|_{2}\right) \mathrm{d} s \\ &=\int_{0}^{\infty} f\left(\frac{1}{2} s^{2}\right)\left(\frac{1}{2} s^{2}\right)^{(d-2)/ 2}\left(\frac{1}{2} s \cdot\|\omega\|_{2}\right)^{(d-2) / 2} J_{(d-2) / 2}\left(s \cdot\|\omega\|_{2}\right) s \mathrm{~d} s \\ &=\int_{0}^{\infty} f\left(\frac{1}{2} s^{2}\right)\left(\frac{1}{2} s^{2}\right)^{(d-2) / 2} H_{(d-2)/ 2}\left(\frac{1}{2} s^{2} \cdot \frac{1}{2}\|\omega\|_{2}^{2}\right) s \mathrm{~d} s \end{aligned}\] with the functions \(J_{\nu}\) and \(H_{r}\) defined by \[\left(\frac{1}{2} z\right)^{-\nu} J_{\nu}(z)=H_{\nu}\left(\frac{1}{4} z^{2}\right)=\sum_{k=0}^{\infty} \frac{\left(-z^{2} / 4\right)^{k}}{k ! \Gamma(k+\nu+1)}=\frac{F_{1}\left(\nu+1 ;-z^{2} / 4\right)}{\Gamma(\nu+1)}\] for \(\nu>-1\). (What on earth do they mean by the two argument form \(F_1(\cdot; \cdot)?\) Is that a 1st-order Hankel transform?) If we substitute \(t=\frac{1}{2} s^{2},\) we find \[\begin{aligned} \hat{g}(\omega)&=\int_{0}^{\infty} f(t) t^{(d-2) / 2} H_{(d-2)/2}\left(t \cdot \frac{1}{2}\|\omega\|^{2}\right) \mathrm{d} t \\ &=:\left(F_{\frac{d-2}{2}} f\right)\left(\|\omega\|^{2} / 2\right) \end{aligned}\] with the general operator \[\begin{aligned} \left(F_{\nu} f\right)(r) &:=\int_{0}^{\infty} f(t) t^{\nu} H_{\nu}(t r) \mathrm{d} t. \end{aligned}\]
\(F_{\frac{d-2}{2}}\) is an operator giving the 1-dimensional representation of the \(d\)-dimensional radial Fourier transform of some radial function \(g(x)=f(\|x\|_2^2/2)\) in terms of the radial parameterization \(f\). Note that this parameterization in terms of squared radius is useful in making the mathematics come out nicely, but it is not longer very much like a Fourier transform. Integrating or differentiating with respect to \(r^2\) (which we can do easily) requires some chain rule usage to interpret in the original space, and we no longer have nice things like Wiener-Khintchin or Bochner theorems with respect to this Fourier-like transform. However, if we can use its various nice properties we can possibly return to the actual Fourier transform and extract the information we want.
\(J_{\nu}\) is the Bessel function of the first kind. What do we call the following? \[\begin{aligned} H_{\nu}:s &\mapsto \sum_{k=0}^{\infty} \frac{\left(-s\right)^{k}}{k ! \Gamma(k+\nu+1)}\\ &=\left(\frac{1}{\sqrt{s}}\right)^{\nu}J_{\nu}(2\sqrt{s}) \label{eq:h-as-j}.\end{aligned}\] I do not know, but it is essential to this theory, since only things which integrate nicely with \(H_{\nu}\) are tractable in this theory. We have integrals like this: For \(\nu>\mu>-1\) and all \(r, s>0\) we have \[\left(F_{\mu} H_{\nu}(s)\right)(r)=\frac{s^{-\nu}(s-r)_{+}^{\nu-\mu-1}}{\Gamma(\nu-\mu)}.\] Now, that does not quite induce a (warped) Hankel transform because of the \(\left(\frac{1}{\sqrt{s}}\right)^{\nu}\) term in \[eq:h-as-j\], but I don’t think that changes the orthogonality of the basis functions, so possibly we can still use a Hankel transform to calculate an approximant to \(f(\sqrt{2s})\) then transforming it
So, in \(d\) dimensions, this makes radial functions can be made from \(H_{(d-2)/2}(s)\). Upon inspection, not many familiar things can be made out of these \(H_{\nu}.\) \(f(r)=\mathbbm{1}\{S\}(r)\) is one; \(f(r)=\exp(-r)\) is another. The others are all odd and contrived or too long to even write down, as far as I can see. Possibly approximations in terms of \(H\) functions would be useful? Up to a warp of the argument, that looks nearly like a Hankel transform.
Comparing it with the Hankel transform \[\begin{aligned} (\mathcal{H}_{\nu }f)(r) &=\int _{0}^{\infty }f(t)tJ_{\nu }(tr)\mathrm{d} t\end{aligned}\]
With this convention, and the symmetry of radial functions, we get \[F^{-1}_{\nu}=F_{\nu}.\] That is, the \(F\) pseudo-Fourier transform is its own inverse. Seems weird, though because of the \(r^2\) term, and the Fourier transform is already close to its own inverse for \(r\)-functions, but if you squint you can imagine this following from the analogous property of the kinda-similar Hankel transforms.
Let \(\nu>\mu>-1.\) Then for all functions \(f: \mathbb{R}_{>0} \rightarrow \mathbb{R}\) with \[f(t) \cdot t^{\nu-\mu-1 / 2} \in L_{1}\left(\mathbb{R}^{+}\right)\] it follows that \[F_{\mu} \circ F_{v}=I_{v-\mu}\] where the integral operator \(I_{\alpha}\) is given by \[\left(I_{\alpha} f\right)(r)=\int_{0}^{\infty} f(s) \frac{(s-r)_{+}^{\alpha-1}}{\Gamma(\alpha)} \mathrm{d} s, \quad r>0, \quad \alpha>0.\] Here we have used the truncated power function \[x_{+}^{n}={\begin{cases}x^{n}&:\ x>0\\0&:\ x\leq 0.\end{cases}}\] It can be extended to \(\alpha\leq 0\) with some legwork.
But what is this operator \(I_{\alpha}\)? Some special cases/extended definitions are of interest: \[\begin{aligned} \left(I_{0} f\right)(r) &:=f(r), & & f \in C\left(\mathbb{R}_{>0}\right) \\ \left(I_{-1} f\right)(r) &:=-f^{\prime}(r), & & f \in C^{1}\left(\mathbb{R}_{>0}\right)\\ I_{-n} &:=(I_{-1})^{\circ n}, & & n>0\\ I_{-\alpha} &:=I_{n-\alpha} \circ I_{-n} & & 0<\alpha \leq n=\lceil\alpha\rceil\end{aligned}\] In general \(I_{\alpha}\) is, up to a sign change, \(\alpha\)-fold integration. Note that \(\alpha\) is not in fact restricted to integers, and we have for free all fractional derivatives and integrals encoded in its values. Neat.
If something can be made to come out nicely with respect to this integral operator \(I_{\alpha},\) especially \(\alpha\in\{-1,1/2,1\}\) then all our calculations come out easy.
We have a sweet algebra over these \(I_{\alpha}\) and \(F_{\nu}\) and their interactions: \[I_{\alpha} \circ I_{\beta} = I_{-\alpha}\circ F_{\nu}.\] Also \[F_{\nu} \circ I_{\alpha} = I_{\alpha+\beta}.\] Or, rearranging, \[F_{\mu} = I_{\mu-\nu} F_{\nu} = F_{\nu} I_{\mu-\nu}.\]
We have fixed points \[I_{\alpha}(\mathrm{e}^{-r}) = \mathrm{e}^{-r}\] and \[F_{\nu}(\mathrm{e}^{-r}) = \mathrm{e}^{-r}.\]
We can use these formulae to calculate multidimensional radial Fourier transforms. With \(\mathcal{F}_{d}:=F_{\frac{d-2}{2}},\) the \(d\) variate Fourier transform written as a univariate operator on radial functions, we find \[\mathcal{F}_{n}=I_{(m-n) / 2} \mathcal{F}_{m}=\mathcal{F}_{m} I_{(n-m) / 2}\] for all space dimensions \(m, n \geq 1 .\) Recursion through dimensions can be done in steps of two via \[\mathcal{F}_{m+2}=I_{-1} \mathcal{F}_{m}=\mathcal{F}_{m} I_{1}\] and in steps of one by \[\mathcal{F}_{m+1}=I_{-1 / 2} \mathcal{F}_{m}=\mathcal{F}_{m} I_{1 / 2}\]
we have some tools for convolving multivariate radial functions by considering their univariate representations. Consider the convolution operator on radial functions \[C_{\nu}: \mathcal{S} \times \mathcal{S} \rightarrow \mathcal{S}\] defined by \[C_{\nu}(f, g)=F_{\nu}\left(\left(F_{\nu} f\right) \cdot\left(F_{\nu} g\right)\right).\] For \(\nu=\frac{d-2}{2}\) it coincides with the operator that takes \(d\)-variate convolutions of radial functions and rewrites the result in radial form. For \(\nu, \mu \in \mathbb{R}\) we have \[C_{\nu}(f, g)=I_{\mu-\nu} C_{\mu}\left(I_{\nu-\mu} f, I_{\nu-\mu} g\right)\] for all \(f, g \in \mathcal{S}.\)
For dimensions \(d \geq 1\) we have \[C_{\frac{d-2}{2}}(f, g)=I_{\frac{1-d}{2}} C_{-\frac{1}{2}}\left(I_{\frac{d-1}{2}} f, I_{\frac{d-1}{2}} g\right).\] If \(d\) is odd, the \(d\) variate convolution of radial functions becomes a derivative of a univariate convolution of integrals of \(f\) and \(g\). For instance, \[\begin{aligned} f *_{3} g &=I_{-1}\left(\left(I_{1} f\right) *_{1}\left(I_{1} g\right)\right) \\ &=-\frac{d}{d r}\left(\left(\int_{r}^{\infty} f\right) *_{1}\left(\int_{r}^{\infty} g\right)\right). \end{aligned}\]
For \(d\) even, to reduce a bivariate convolution to a univariate convolution, one needs the operations \[\left(I_{1 / 2} f\right)(r)=\int_{r}^{\infty} f(s) \frac{(s-r)^{-1 / 2}}{\Gamma(1 / 2)} \mathrm{d} s\] and the semi-derivative \[\left(I_{-1 / 2} f\right)(r)=\left(I_{1 / 2} I_{-1} f\right)(r)=-\int_{r}^{\infty} f^{\prime}(s) \frac{(s-r)^{-1 / 2}}{\Gamma(1 / 2)} \mathrm{d} s\]
Note that the operators \(I_{1}, I_{-1},\) and \(I_{1 / 2}\) are much easier to handle than the Hankel transforms \(F_{\mu}\) and \(\mathcal{F}_{m} .\) This allows simplified computations of Fourier transforms of multivariate radial functions, if the univariate Fourier transforms are known.
Now, how do we solve PDEs this way? Starting with some test function \(f_{0},\) we can define \[f_{\alpha}:=I_{\alpha} f_{0} \quad(\alpha \in \mathbb{R})\] and get a variety of integral or differential equations from application of the \(I_{\alpha}\) operators via the identities \[f_{\alpha+\beta}=I_{\beta} f_{\alpha}=I_{\alpha} f_{\beta}\] Furthermore, we can set \(g_{\nu}:=F_{\nu} f_{0}\) and get another series of equations \[\begin{array}{l} I_{\alpha} g_{\nu}=I_{\alpha} F_{\nu} f_{0}=F_{\nu-\alpha} f_{0}=g_{\nu-\alpha} \\ F_{\mu} g_{\nu}=F_{\mu} F_{\nu} f_{0}=I_{\nu-\mu} f_{0}=f_{\nu-\mu} \\ F_{\mu} f_{\alpha}=F_{\mu} I_{\alpha} f_{0}=F_{\mu+\alpha} f_{0}=g_{\mu+\alpha} \end{array}\]
For compactly supported functions, we proceed as follows: We now take the characteristic function \(f_{0}(r)=\chi_{[0,1]}(r)\) and get the truncated power function \[\left(I_{\alpha} f_{0}\right)(r)=\int_{0}^{1} \frac{(s-r)_{+}^{\alpha+1}}{\Gamma(\alpha)} d s=\frac{(1-r)_{+}^{\alpha}}{\Gamma(\alpha+1)}=f_{\alpha}(r), \quad \alpha>0\] Now we find \[f_{\alpha}=F_{\mu} H_{\nu}\] for \(\nu-\mu=\alpha+1, \nu>\mu>-1\) and \[F_{\mu} f_{\alpha}=H_{\mu+\alpha+1}\] for \(\alpha>0, \mu>-1 .\)
Apparently a whole field? See Pewsey and García-Portugués (2020).
I used to complain that the decomposition of politics into “left” and “right” was arbitrary and unhelpful and should be done away with. Now I believe that people choosing to interpret the richness of the great human project in terms of partisan categorisation is unavoidable, so I am advocating for some other axes to decompose it into instead, descriptively or prescriptively.
Presumably many, but the one I saw a seminar on was Leibon et al. (2011).
Satellite images of various stripes are review. If you just want eye candy, NASA Visible Earth is a good one. I’m fond of LANDSAT maps. Various can be found through earth explorer.
See also Australia-specific stuff.
Also interesting CHIRPS: Rainfall Estimates from Rain Gauge and Satellite Observations
pangeo is an umbrella organisation providing many geospatial data tools including a catalogue
Extreme Weather Dataset Racah et al. (2017) includes for each year a (1460,16,768,1152) array, containing
A sum or product (or outer sum, or tensor product) of kernels is still a kernel. For other transforms YMMV.
For example, in the case of Gaussian processes, suppose that, independently,
\[\begin{aligned} f_{1} &\sim \mathcal{GP}\left(\mu_{1}, k_{1}\right)\\ f_{2} &\sim \mathcal{GP}\left(\mu_{2}, k_{2}\right) \end{aligned}\] then
\[ f_{1}+f_{2} \sim \mathcal{GP} \left(\mu_{1}+\mu_{2}, k_{1}+k_{2}\right) \] so \(k_{1}+k_{2}\) is also a kernel.
More generally, if \(k_{1}\) and \(k_{2}\) are two kernels, and \(c_{1}\), and \(c_{2}\) are two positive real numbers, then:
\[ K(x, x')=c_{1} k_{1}(x, x')+c_{2} k_{2}(x, x') \] is again a kernel. What with the multiplication as well, we note that all polynomials of kernels where the coefficients are positive are in turn kernels. (Genton 2001)
Note that the additivity in terms of kernels is not the same as additivity in terms of induced feature spaces. The induced feature map of \(k_{1}+k_{2}\) is their concatenation rather than their sum. Suppose \(\phi_{1}(x)\) gives us the feature map of \(k_{1}\) for \(x\) and likewise \(\phi_{2}(x)\).
\[\begin{aligned} k_{1}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x') \\ k_{2}(x, x') &=\phi_{2}(x)^{\top} \phi_{2}(x')\\ k_{1}(x, x')+k_{2}(x, x') &=\phi_{1}(x)^{\top} \phi_{1}(x')+\phi_{2}(x)^{\top} \phi_{2}(x')\\ &=\left[\begin{array}{c}{\phi_{1}(x)} \\ {\phi_{2}(x)}\end{array}\right]^{\top} \left[\begin{array}{c}{\phi_{1}(x')} \\ {\phi_{2}(x')}\end{array}\right] \end{aligned}\]
If \(k:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}\) is a kernel and \(\psi: \mathcal{X}\to\mathcal{Y}\) this is also a kernel
\[\begin{aligned} k_{\psi}:&\mathcal{X}\times\mathcal{X}\to\mathbb{R}\\ & (x,x')\mapsto k_{y}(\psi(x), \psi(x')) \end{aligned}\]
which apparently is now called a deep kernel. if \(k\) is stationary and \(\psi\) is invertible then this is a stationary reducible kernel.
Also if \(A\) is a positive definite operator, then of course it defines a kernel \(k_A(x,x'):=x^{\top}Ax'\)
(Genton 2001) uses the properties of covariance to construct some other nifty ones:
Let \(h:\mathcal{X}\to\mathbb{R}^{+}\) have minimum at 0. Then, using the identity for RVs
\[ \mathop{\textrm{Cov}}\left(Y_{1}, Y_{2}\right)=\left[\mathop{\textrm{Var}}\left(Y_{1}+Y_{2}\right)-\mathop{\textrm{Var}}\left(Y_{1}-Y_{2}\right)\right] / 4 \]
we find that the following is a kernel
\[ K(x, x')=\frac{1}{4}[h(x+x')-h(x-x')] \]
All these to various cunning combination strategies, which I may return to discuss. 🏗 Some of them are in the references. For example (Duvenaud et al. 2013) position their work in the wider field:
There is a large body of work attempting to construct a rich kernel through a weighted sum of base kernels (e.g. (Bach 2008; Christoudias, Urtasun, and Darrell 2009)). While these approaches find the optimal solution in polynomial time, speed comes at a cost: the component kernels, as well as their hyperparameters, must be specified in advance […]
(Hinton and Salakhutdinov 2008) use a deep neural network to learn an embedding; this is a flexible approach to kernel learning but relies upon finding structure in the input density, p(x). Instead we focus on domains where most of the interesting structure is in f(x).
(Wilson and Adams 2013) derive kernels of the form \(SE × \cos(x − x_0\)), forming a basis for stationary kernels. These kernels share similarities with \(SE × Per\) but can express negative prior correlation, and could usefully be included in our grammar.
See (Grosse et al. 2012) for a mind-melting compositional matrix factorization diagram, constructing a search over hierarchical kernel decompositions.
Examples of existing machine learning models which fall under our framework. Arrows represent models reachable using a single production rule. Only a small fraction of the 2496 models reachable within 3 steps are shown, and not all possible arrows are shown.
(Genton 2001) defines these as kernels that have a particular structure, specifically, a kernel that can be factored into a stationary kernel \(K_2\) and a non negative function \(K_1\) in the following way:
\[ K(\mathbf{s}, \mathbf{t})=K_{1}\left(\frac{\mathbf{s}+\mathbf{t}}{2}\right) K_{2}(\mathbf{s}-\mathbf{t}) \]
Global structure then depends on the mean location \(\frac{\mathbf{s}+\mathbf{t}}{2}\). (Genton 2001) describes some nifty spectral properties of these kernels.
Other constructions might vie for the title of “locally stationary”. To check. 🏗
See kernel warping.
What follows are some useful kernels to have in my toolkit, mostly over \(\mathbb{R}^n\) or at least some space with a metric. There are many more than I could fit here, of course. And kernels are defined over many space, Real vectors, strings, other kernels, probability distributions etc.
For these I have freely raided David Duvenaud’s crib notes which became a thesis chapter (D. Duvenaud 2014). Also wikipedia and (Abrahamsen 1997; Genton 2001).
A popular assumption, more or less implying implies that no region of the process is special. In this case the kernel may be written as a function purely of the distance between \[K(s,t)=K(\|s-t\|)\] for some distance \(\|\cdot\|\) between the observation coordinates. This kind of translation-invariant kernel is the default. They are convenient analysed in terms of the Wienere-Khinchine theorem
The kernel is a function of the inner product/dot product of the input coordinates, \[K(s,t)=K(s\cdot t).\] Hard to google for because there is the confounding fact that kernels already define inner products in some other space so it’s an inner product defined in terms of an inner product. Such kernels are rotation invariant but not stationary. Instead of the Fourier relationships that stationary kernels have, these a neat relationship to Legendre bases for radial functions (Smola, Óvári, and Williamson 2000) which tells you whether your dot product does in fact define an inner product.
A really interesting dot-product kernel is the arc-cosine kernel (Cho and Saul 2009).
\[ k_{n}(\mathbf{x}, \mathbf{y})=2 \int \mathbf{d} \mathbf{w} \frac{e^{-\frac{\|\mathbf{w}\|^{2}}{2}}}{(2 \pi)^{d / 2}} \Theta(\mathbf{w} \cdot \mathbf{x}) \Theta(\mathbf{w} \cdot \mathbf{y})(\mathbf{w} \cdot \mathbf{x})^{n}(\mathbf{w} \cdot \mathbf{y})^{n} \]
The kernel result is compactly expressed in terms of the angle \(\theta\) between the inputs: \[ \theta=\cos ^{-1}\left(\frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}\right) \] Specifically, \[ k_{n}(\mathbf{x}, \mathbf{y})=\frac{1}{\pi}\|\mathbf{x}\|^{n}\|\mathbf{y}\|^{n} J_{n}(\theta) \] where \(J_{n}(\theta)\) is given by: \[ J_{n}(\theta)=(-1)^{n}(\sin \theta)^{2 n+1}\left(\frac{1}{\sin \theta} \frac{\partial}{\partial \theta}\right)^{n}\left(\frac{\pi-\theta}{\sin \theta}\right) \] The first few \(J_{n}\) are \[ \begin{array}{l} J_{0}(\theta)=\pi-\theta \\ J_{1}(\theta)=\sin \theta+(\pi-\theta) \cos \theta. \end{array} \] This recovers the ReLU activation in the infinite width limit.
Time-indexed processes are more general than a standard Wiener process. 🏗
What constraints make a covariance kernel causal?
How can we know from inspecting a kernel whether it implies an independence structure of some kind? The Wiener process and causal kernels clearly imply certain independences. Any kernel \(k(s,t)=k(s\wedge t)\) is clearly Markov. Are there more general ones? TODO: relate to kernels of bounded support. 🏗
The covariance kernel which is possessed by a standard Wiener process, which is a process with Gaussian increments, which is a certain type of dependence. It is over a boring index space, time \(t\in \mathbb{R}\). We can read this right off the Wiener process Wikipedia page. For a Gaussian process \(\{W_t\}_{t\in\mathbb{R}},\)
\[ {\displaystyle \operatorname {cov} (W_{s},W_{t})=s \wedge t} \]
Here \(s \wedge t\) here means “the minimum of \(s\) and \(t\)”.
That result is standard. From it we can immediately construct the kernel \(K(s,t):=s \wedge t\).
A.k.a. exponentiated quadratic. Often radial basis functions mean this also.
The classic, default, analytically convenient, because it is proportional to the Gaussian density and therefore cancels out with it at opportune times.
\[k_{\textrm{SE}}(x, x') = \sigma^2\exp\left(-\frac{(x - x')^2}{2\ell^2}\right)\]
Duvenaud reckons this is everywhere but TBH I have not seen it. Included for completeness.
\[k_{\textrm{RQ}}(x, x') = \sigma^2 \left( 1 + \frac{(x - x')^2}{2 \alpha \ell^2} \right)^{-\alpha}\]
Note that \(\lim_{\alpha\to\infty} k_{\textrm{RQ}}= k_{\textrm{SE}}\).
The Matérn stationary (and in the Euclidean case, isotropic) covariance function is one a surprisingly convenient model for covariance. See Carl Edward Rasmussen’s Gaussian Process lecture notes for a readable explanation, or chapter 4 of his textbook (Rasmussen and Williams 2006).
\[ k_{\textrm{Mat}}(x, x')=\sigma^{2} \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\sqrt{2 \nu} \frac{x - x'}{\rho}\right)^{\nu} K_{\nu}\left(\sqrt{2 \nu} \frac{x - x'}{\rho}\right) \]
where \(\Gamma\) is the gamma function, \(\ K_{\nu }\) is the modified Bessel function of the second kind, and \(\rho,\nu\geq 0\).
The parameterization gives you directly how differentiable the solution is. Nifty.
\[ k_{\textrm{Per}}(x, x') = \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right) \]
This is an example of a composed kernel, explained below.
\[\begin{aligned} k_{\textrm{LocPer}}(x, x') &= k_{\textrm{Per}}(x, x')k_{\textrm{SE}}(x, x') \\ &= \sigma^2\exp\left(-\frac{2\sin^2(\pi|x - x'|/p)}{\ell^2}\right) \exp\left(-\frac{(x - x')^2}{2\ell^2}\right) \end{aligned}\]
Obviously there are other possible localisations of a periodic kernel. This is a locally periodic kernel. NB it is not local in the sense of Genton’s local stationarity, just local in the sense that one kernel is ‘enveloped’ by another.
I just noticed the ambiguously named Integral kernel:
I’ve called the kernel the ‘integral kernel’ as we use it when we know observations of the integrals of a function, and want to estimate the function itself.
Examples include:
- Knowing how far a robot has travelled after 2, 4, 6 and 8 seconds, but wanting an estimate of its speed after 5 seconds…
- Wanting to know an estimate of the density of people aged 23, when we only have the total count for binned age ranges…
I would argue that all kernels are naturally defined in terms of integrals, but the author seems to mean something particular. I suspect I would call this a sampling kernel, but that name is also overloaded. Anyway, what is actually going on here? Where is it introduced? Possibly one of (Smith, Alvarez, and Lawrence 2018; O’Callaghan and Ramos 2011; Murray-Smith and Pearlmutter 2005).
See composing kernels.
(Sun et al. 2018; Bochner 1959; Kom Samo and Roberts 2015; Yaglom 1987) construct spectral kernels in the sense that they use the spectral representation to design the kernel and guarantee it is positive definite and stationary. You coudl think of this as a kind of limiting case of composing kernels with a Fourier basis. See Bochner’s theorem.
(Sun et al. 2018; Remes, Heinonen, and Kaski 2017; Kom Samo and Roberts 2015) use a generalised Bochner Theorem (Yaglom 1987) often called Yaglom’s Theorem, which does not presume stationarity. See Yaglom’s theorem.
It is not immediately clear how to use this; spectral representations are not an intuitive way of constructing things.
We usually think about compactly supported kernels in the stationary isotropic case, where we mean kernels that vanish whenever the distance between two observation \(s,t\) is larger than a certain cut-off distance \(L,\) i.e. \(\|s-t\|>L\Rightarrow K(s,t)=0\). These are great because they make the Gram matrix sparse (for example, if the cut-off is much smaller than the diameter of the observations and most observations have few covariance neighbours) and so can lead to computational efficiency even for exact inference without any special tricks. They don’t seem to be popular? Statisticians are generally nervous around inferring the support of a parameter, or assigning zero weight to any region of a prior without good reason, so maybe it is that?
\[ \max \left\{\left(1-\frac{\|\mathbf{s}-\mathbf{t}\|}{\tilde{\theta}}\right)^{\tilde{\nu}}, 0\right\} \] and handballs us to Gneiting (2002b) for a bigger smörgåsbord of stationary compactly supported kernels. Yjr Gneiting (2002b) article has a couple of methods designed to produce certain smoothness properties at boundary and origin, but mostly concerns producing compactly supported kernels via clever integral transforms.
NB if you are trying specifically to enforce sparsity here, it might be worth considering the kernel induced by a stochaastic convolution.
That’s my name for them because they seem to originate in (Genton 2001).
For any non-negative function \(h:\mathcal{T}\to\mathbb{R}^+\) with \(h(\mathbf{0})=0,\) the following is a kernel:
\[ K(\mathbf{s}, \mathbf{t})=\frac{[h(\mathbf{s}+\mathbf{t})-h(\mathbf{s}-\mathbf{t})}{4} \] Genton gives the example of \(h:x\mapsto \|x\|_2^2.\) instance, consider the function \(h(\mathbf{x})=\mathbf{x}^{T} \mathbf{x} .\) From \((9),\) we obtain the kernel: \[ K(\mathbf{x}, \mathbf{z})=\frac{1}{4}\left[(\mathbf{x}+\mathbf{z})^{T}(\mathbf{x}+\mathbf{z})-(\mathbf{x}-\mathbf{z})^{T}(\mathbf{x}-\mathbf{z})\right]=\mathbf{x}^{T} \mathbf{z} \] The motivation is the identity
\[ \operatorname { Covariance }\left(Y_{1}, Y_{2}\right)= \frac{\operatorname { Variance }\left(Y_{1}+Y_{2}\right)-\operatorname { Variance }\left(Y_{1}-Y_{2}\right)}{ 4}. \]
(D. Duvenaud 2014, chap. 2) summarises Ginsbourger et al’s work on kernels with desired symmetries / invariances. 🏗 This produces for example, the periodic kernel above, but also such cute tricks as priors over Möbius strips.
See kernel warping.
Disentangling the programs you are running from the terminal you launched them in, for recombining and tweaking.
Typically one encounters this concept via tmux
or the rather ancient screen
, which do a lot of stuff at once, making it harder to explain.
The convenience in is case might be outweighed by the confusion.
It was for me.
For example, it seems easy to confuse terminal session management with window management-plus-terminal emulators which is rather a different thing.
tmux
, for example, is:
The combination of these things is called a terminal multiplexer and we think about it as a way of multiplexing several terminal sessions into one network session, which is what people usually pitch to you. For me the idea of temporarily detaching processes from the launching terminal and then resuming control of them later is the killer feature. Multiple terminals and such are a side effect that I rarely use, which means that a lot fo the breathless tutorials focus on features I do not need.
These concerns are nicely separated in abduco
, whose architecture is much clearer.
abduco
+dvtm
abduco
is a tool which does strictly #1.
abduco
provides session management i.e. it allows programs to be run independently from their controlling terminal. That is programs can be detached - run in the background - and then later reattached.
i.e. after I get booted off the server running my slow job I can get back online and check on its progress later without anything going south because of my network embarrassment.
However, if I am running some interactive process I might want that session to contain a nice virtual terminal which keeps track of the state of my screen and IO and all that stuff.
dvtm
is the twin to abduco
that provides such niceties as clean and re-usable screen state.
Now that we have used abduco for its pedagogic clarity, let us do what everyone else does, and ignore it in favour of the slightly more confusing tmux
, which is well documented an supported and ubiquitous and thus has great practical advantages over its ideologically and pedagogically purer cousin.
tmux
tmux
is functionally a combination of abduco
and dvtm
.
On one hand it is easy because it is so popular and thus documented everywhere, and installed most places.
On the other, it is confusing and has weird terminology, so you need all that documentation to work out what you just did.
There is a comparison on slant of the tmux
and abduco
ecosystems.
A neat feature of tmux
is that it has a magical integration with iterm2 on macOS in
“control centre” mode, which makes everything intuitive by recycling the terminal GUI for handling session management.
However, this is is controversial.
Anyway, here are some intros to tmux
:
0
1,
2 and
a cheat sheet
tl;dr: It creates “sessions” which seem to be connections to a host, which contain “windows”, which are virtual terminals within that session. Both these persist if you log in or out.
World’s shortest introduction:
tmux ls # list sessions
tmux attach -t 0 # resume a session
Ctrl-b c
Ctrl-b p
/C-b n
Once I had used tmux for a while I discovered I wished to do backwards scrolling.
There are various keyboard shortcuts,
and a mouse scroll mode.
If the mouse scroll mode causes things to break after quitting tmux, and now clicking on the window causes this kind of crap: 0;38;15M 0;38;15m 0;60;12M0;60;12m0;56;14M0;56;14m0;56;14M0;56;14m0;54;13M0;54;13m0;54;13M0;54;13m
… eek!
Running reset
puts things right in some circumstances.
Other times (in hyper) reloading the terminal from the app window is necessary (Ctrl-Shift-R
).
There are various other tools in the ecosystem, e.g. tmuxinator is a config tool for tmux.
Byobu is a GPLv3 open source text-based window manager and terminal multiplexer.… Byobu now includes an enhanced profiles, convenient keybindings, configuration utilities, and toggle-able system status notifications for both the GNU Screen window manager and the more modern Tmux terminal multiplexer, and works on most Linux, BSD, and Mac distributions.
byobu uses (by default) one socket you can designate a particular socket and share ‘write’ permission with another users and BAM you have multi-session multi-user work for any application that can run a terminal
Question: since tmux
can already work over named sockets, can tmux
do this without special treatment from byobu
?
mtm is an ultra-minimalist tmux replacement. I have not worked out which of the various features discussed here it implements.
As far as I can tell there is some kind of standardisation battle being waged.
HTM is an in-progress (?) terminal multiplexer standard.
Eternal Terminal and possibly hyper and possibly microsoft terminal may at some point support it.
It is supposed to be ‘more open’ than tmux
’s control mode, although the documentation does not seem easier to locate, so it must be considered in more open in some abstract moral sense that does not necessarily have a concrete impact on outsiders.
If HTM were to make progress, there would be a system for integrating various session managers/multiplexers into the GUI of the hosting terminal emulator in a cross-platform and mutually compatible way.
This sounds nice I suppose but not nice enough to wander into a poorly explained standards quagmire of fringe technologies.
I will not complain if this technology materialises and I am blessed with the chance to free-ride upon the efforts of others by using it.
Gaussian processes by convolution of noise with smoothing kernels, which is a kind of dual to defining them through covariances.
This is especially interesting because it can be made computationally convenient (we can enforce locality) and non-stationarity.
A convenient representation of a GP model uses process convolutions (Barry and Hoef 1996; Dave Higdon 2002; Thiebaux and Pedder 1987). One may construct a Gaussian process \(z(\mathbf{s})\) over a region \(\mathcal{S}\) by convolving a continuous, unit variance, white noise process \(x(\mathbf{s}),\) with a smoothing kernel \(k(\mathbf{s}):\) \[ z(\mathbf{s})=\int_{\mathcal{S}} k(\mathbf{u}-\mathbf{s}) x(\mathbf{u}) d \mathbf{u} \]
If we take \(x(\mathbf{s})\) to be an intrinsically stationary process with variogram \(\gamma_{x}(\mathbf{d})=\operatorname{Var}(x(\mathbf{s})-\) \(x(\mathbf{s}+\mathbf{d}))\) the resulting variogram of the process \(z(\mathbf{s})\) is given by \[ \gamma_{z}(\mathbf{d})=\gamma_{z}^{*}(\mathbf{d})-\gamma_{z}^{*}(\mathbf{0}) \text { where } \gamma_{z}^{*}(\mathbf{q})=\int_{\mathcal{S}} \int_{\mathcal{S}} k(\mathbf{v}-\mathbf{q}) k(\mathbf{u}-\mathbf{v}) \gamma_{x}(\mathbf{u}) d \mathbf{u} d \mathbf{v} \] …With this approach, one can fix the smoothing kernel \(k(\mathbf{s})\) and then modify the spatial dependence for \(z(\mathbf{s})\) by controlling \(\gamma_{x}(\mathbf{d}) .\)
e.g. Dave Higdon, Swall, and Kern (1999); David Higdon (1998). Alternatively we can fix the driving noise and vary the smoothing kernel. TBC.
The representation of certain random fields, especially Gaussian random fields as stochastic differential equations. This is the engine that makes filtering Gaussian processes go, and is also a natural framing for probabilistic spectral analysis.
I do not have much to say right now about this, but I am using it so watch this space.
The Gauss-Markov Random Field approach.
Warning: I’m taking crib notes for myself here, so I lazily switch between signal processing filter terminology and probabilist termonology. I assume Bochner’s and Yaglom’s Theorems as comprehensible methods for analysing covariance kernels.
Let’s start with stationry kernels. We consider an SDE \(f: \mathbb{R}\to\mathbb{R}\) at stationarity. We will let its driving noise to be some Wiener process. We care concerned with deriving the parameters of the SDE such that it has a given stationary covariance function \(k\).
If there are no zeros in the spectral density, then there are no poles in the inverse transfer function, and we can model it with an all-pole SDE. This includes all the classic Matérn functions. This is covered in Hartikainen and Särkkä (2010), and Lindgren, Rue, and Lindström (2011). Worked examples starting from a discrete time formulation are given in a tutorial introduction Grigorievskiy and Karhunen (2016).
More generally, (quasi-)periodic covariances have zeros and we need to find a full rational function approximation. Särkkä, Solin, and Hartikainen (2013) introduces one such method. Bolin and Lindgren (2011) explores a sligtly different class
Solin and Särkkä (2014) has a fancier method employing resonators a.k.a. filter banks, to address a concern of Steven Reece et al. (2014) that atomic spectral peaks in the Fourier transform are not well approximated by rational functions.
Bolin and Lindgren (2011) consider a general class of realisable systems, given by \[ \mathcal{L}_{1} X(\mathbf{s})=\mathcal{L}_{2} \mathcal{W}(\mathbf{s}) \] for some linear operators \(\mathcal{L}_{1}\) and \(\mathcal{L}_{2} .\)
In the case that \(\mathcal{L}_{1}\) and \(\mathcal{L}_{2}\) commute, this may be put in hierarchical form: \[\begin{aligned} \mathcal{L}_{1} X_{0}(\mathbf{s})&=\mathcal{W}(\mathbf{s})\\ X(\mathbf{s})&=\mathcal{L}_{2} X_{0}(\mathbf{s}). \end{aligned}\]
They explain
\(X(\mathbf{s})\) is simply \(\mathcal{L}_{2}\) applied to the solution one would get to if \(\mathcal{L}_{2}\) was the identity operator.
They call this a nested PDE, although AFAICT you could also say ARMA. They are particularly interested in equations of this form: \[ \left(\kappa^{2}-\Delta\right)^{\alpha / 2} X(\mathbf{s})=\left(b+\mathbf{B}^{\top} \nabla\right) \mathcal{W}(\mathbf{s}) \]
The SPDE generating this class of models is \[ \left(\prod_{i=1}^{n_{1}}\left(\kappa^{2}-\Delta\right)^{\alpha_{i} / 2}\right) X(\mathbf{s})=\left(\prod_{i=1}^{n_{2}}\left(b_{i}+\mathbf{B}_{i}^{\top} \nabla\right)\right) \mathcal{W}(\mathbf{s}) \]
They show that spectral density for such an \(X(\mathbf{s})\) is given by \[ S(\mathbf{k})=\frac{\phi^{2}}{(2 \pi)^{d}} \frac{\prod_{j=1}^{n_{2}}\left(b_{j}^{2}+\mathbf{k}^{\top} \mathbf{B}_{j} \mathbf{B}_{j}^{\top} \mathbf{k}\right)}{\prod_{j=1}^{n_{1}}\left(\kappa_{j}^{2}+\|\mathbf{k}\|^{2}\right)^{\alpha_{j}}}. \]
See stochastic convolution or pragmatically, assume Gaussianity and see Gaussian convolution processes.
Stochastic processes by convolution of white noise with smoothing kernels.
For now this effectively means Gaussian processes.
This is usually in the context of Gaussian processes where everything can work out nicely if you are lucky, but other kernel machines are OK too. The goal for most of these is to maximise the marginal posterior likelihood, a.k.a. model evidence, as is conventional in Bayesian ML.
🏗
Automating kernel design by some composition of simpler atomic kernels. AFAICT this started from summaries like (Genton 2001) and went via Duvenaud’s aforementioned notes to became a small industry (Lloyd et al. 2014; DuvenaudAdditive2011?; Duvenaud et al. 2013; Grosse et al. 2012). A prominent example was the Automated statistician project by David Duvenaud, James Robert Lloyd, Roger Grosse and colleagues, which works by greedy combinatorial search over possible compositions.
More fashionable, presumably, are the differentiable search methods. For example, the AutoGP system (Krauth et al. 2016; Bonilla, Krauth, and Dezfouli 2019) incorporates tricks like these to use gradient descent to design kernels for Gaussian processes. (Sun et al. 2018) construct deep networks of composed kernels. I imagine the Deep Gaussian Process literature is also of this kind, but have not read it.
Kernels on kernels, for kernel learning kernels. 🏗 (Ong, Smola, and Williamson 2005, 2002; Ong and Smola 2003; Kondor and Jebara 2006)
Split off from autoML.
The art of choosing the best hyperparameters for your ML model’s algorithms, of which there may be many.
Should you bother getting fancy about this? Ben Recht argues no, that random search is competitive with highly tuned Bayesian methods in hyperparameter tuning. Kevin Jamieson argues you can be cleverer than that though. Let’s inhale some hype.
Loosely, we think of interpolating between observations of a loss surface and guessing where the optimal point is. See Bayesian optimisation. This is generic. Not as popular in practice as I might have assumed because it turns out to be fairly greedy with data and does not exploit problem-specific ideas, such as early stopping, which is saves time and is in any case a useful type of neural net regularisation.
Maclaurin, Duvenaud, and Adams (2015):
Each meta-iteration runs an entire training run of stochastic gradient de- scent to optimize elementary parameters (weights 1 and 2). Gradients of the validation loss with respect to hyperparameters are then computed by propagating gradients back through the elementary training iterations. Hyperparameters (in this case, learning rate and momentum schedules) are then updated in the direction of this hypergradient. … The last remaining parameter to SGD is the initial parameter vector. Treating this vector as a hyperparameter blurs the distinction between learning and meta-learning. In the extreme case where all elementary learning rates are set to zero, the training set ceases to matter and the meta-learning procedure exactly reduces to elementary learning on the validation set. Due to philosophical vertigo, we chose not to optimize the initial parameter vector.
Their implementation, hypergrad,
is no longer maintained.
Possibly the same, drmad
by Fu et al. (2016), also not maintained.
This is a neat trick, but it has at least one clear limitation: it generally requires an estimate of the overfitting penalty as in the style of a degrees-of-freedom penalty. There are various assumptions on the optimisation and model process also that I forget right now, but they resemble the setting of learning odes and so are possibly worth examining through that lense.
Just what you would think.
Now it comes in an adaptive flavour that leverages the SGD fitting method e.g. Liam Li et al. (2020). called hyperband Lisha Li et al. (2017)/ ASHA.
Most of the implementations here use, internally, a surrogate model for parameter tuning, but wrap it with some tools to control and launch experiments in parallel, early termination etc.
Arranged so that the top few are hyped and popular and after that are less renowed hipster options.
Not yet filed:
determined includes hyperparameter tuning which is not in fact a surrogate surface, but an early stopping pruning of crappy models in a random search., i.e. fancy random search.
Tune is a Python library for experiment execution and hyperparameter tuning at any scale. Core features:
- Launch a multi-node distributed hyperparameter sweep in less than 10 lines of code.
- Supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras.
- Automatically manages checkpoints and logging to TensorBoard.
- Choose among state of the art algorithms such as Population Based Training (PBT), BayesOptSearch, HyperBand/ASHA. (Liam Li et al. 2020)
optuna (Akiba et al. 2019) supports fancy neural net training; similar to hyperopt AFAICT except that is supports Covariance Matrix Adaptation, whatever that is ? (see Hansen (2016)).
Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Thanks to our define-by-run API, the code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.
hyperopt
J. Bergstra, Yamins, and Cox (2013)
is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.
Currently two algorithms are implemented in hyperopt:
- Random Search
- Tree of Parzen Estimators (TPE)
Hyperopt has been designed to accommodate Bayesian optimization algorithms based on Gaussian processes and regression trees, but these are not currently implemented.
All algorithms can be run either serially, or in parallel by communicating via MongoDB or Apache Spark
auto-sklearn has recently been upgraded. details TBD.@FeurerAutoSklearn2020
skopt
(aka scikit-optimize
)
[…] is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization.
Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper (Snoek, Larochelle, and Adams 2012).
The code consists of several parts. It is designed to be modular to allow swapping out various ‘driver’ and ‘chooser’ modules. The ‘chooser’ modules are implementations of acquisition functions such as expected improvement, UCB or random. The drivers determine how experiments are distributed and run on the system. As the code is designed to run experiments in parallel (spawning a new experiment as soon a result comes in), this requires some engineering.
Spearmint2
is similar, but more recently
updated and fancier; however it has a restrictive license prohibiting wide
redistribution without the payment of fees. You may or may not wish to trust
the implied level of development and support of 4 Harvard Professors,
depending on your application.
Both of the Spearmint options (especially the latter) have opinionated
choices of technology stack in order to do their optimizations, which means
they can do more work for you, but require more setup, than a simple little
thing like skopt
.
Depending on your computing environment this might be an overall plus or a
minus.
SMAC
(AGPLv3)
(sequential model-based algorithm configuration) is a versatile tool for optimizing algorithm parameters (or the parameters of some other process we can run automatically, or a function we can evaluate, such as a simulation).
SMAC has helped us speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions. Recently, we have also found it to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms. Finally, the predictive models SMAC is based on can also capture and exploit important information about the model domain, such as which input variables are most important.
We hope you find SMAC similarly useful. Ultimately, we hope that it helps algorithm designers focus on tasks that are more scientifically valuable than parameter tuning.
Python interface through pysmac.
Won the land-grab for the name automl
but is now unmaintained.
A quick overview of buzzwords, this project automates:
- Analytics (pass in data, and auto_ml will tell you the relationship of each variable to what it is you’re trying to predict).
- Feature Engineering (particularly around dates, and soon, NLP).
- Robust Scaling (turning all values into their scaled versions between the range of 0 and 1, in a way that is robust to outliers, and works with sparse matrices).
- Feature Selection (picking only the features that actually prove useful).
- Data formatting (turning a list of dictionaries into a sparse matrix, one-hot encoding categorical variables, taking the natural log of y for regression problems).
- Model Selection (which model works best for your problem).
- Hyperparameter Optimization (what hyperparameters work best for that model).
- Ensembling Subpredictors (automatically training up models to predict smaller problems within the meta problem).
- Ensembling Weak Estimators (automatically training up weak models on the larger problem itself, to inform the meta-estimator’s decision).
Signalling games, subcultures, slang, El Farol Bars, anti-inductive systems, glass bead games, fake martial arts, level of simulacra, genre speciation, as transmitted on the social information graph …
Question: Is this a dig a Yudkowsky?
Robin Hanson argues against irony for being outgroup-exclusionary. I don’t think blanket discouraging irony is plausible or desirable, but… the insight is useful. It is important to remember that indicators of in-group membership, such as irony, are shibboleths, not indicators of quality.
This one is interesting not because it is not one of Hanson’s best ideas, IMO. I like it as a trial balloon for some better ideas that do not conflate local culture with hostility to wider culture, or assume the plausibilty of operating without in-groups.
David Chapman’s Geeks, MOPs and Sociopaths model, explains subcultural dynamics as a business model, looking at how fun things become mass market things.
The Phatic and the anti-inductive.
Douglas Adams once said there was a theory that if anyone ever understood the Universe, it would disappear and be replaced by something even more incomprehensible. He added that there was another theory that this had already happened.
These sorts of things — things such that if you understand them, they get more complicated until you don’t — are called “anti-inductive”.
Right is the new Left. Metacontrarianism. Nice things:
But what, then, in such a place, is nice? Is nice now an aesthetic, a style, rather than a substance? And hence, is it the aesthetic of having the aesthetic? Has the thing been banished by the symbolic representation of the thing? Does this tyranny of superficial niceness inevitably create a particular ideological cascade? It seems to. Does it banish truth, slowly pressuring all into conformity, via the Scott Alexander quote that cannot be repeated enough times, so here it is again:
Sometimes I can almost feel this happening. First I believe something is true, and say so. Then I realize it’s considered low-status and cringeworthy. Then I make a principled decision to avoid saying it — or say it only in a very careful way — in order to protect my reputation and ability to participate in society. Then when other people say it, I start looking down on them for being bad at public relations. Then I start looking down on them just for being low-status or cringeworthy. Finally the idea of “low-status” and “bad and wrong” have merged so fully in my mind that the idea seems terrible and ridiculous to me, and I only remember it’s true if I force myself to explicitly consider the question. And even then, it’s in a condescending way, where I feel like the people who say it’s true deserve low status for not being smart enough to remember not to say it. This is endemic, and I try to quash it when I notice it, but I don’t know how many times it’s slipped my notice all the way to the point where I can no longer remember the truth of the original statement.
Explaining those odd terms the alt-right use to troll their opponents, and complaints about virtue signallers as virtue signalling.
When I have a moment I would like to muster some thoughts about how in-group and outgroup signals like these are weaponised in practice, and the types of coordinations that are robust for society.
Closer to home, consider “Political correctness” accusations are always explosive at family dinner. 🏗
see memetics.
Fake Martial Arts, A Disorientation
However, even fighters who enjoy making fun of fake martial arts in the “not useful for MMA” sense are still quick to admit that all martial arts are largely fake […]. Ramsey Dewey in particular enjoys demonstrating the fallacy of trying to fight off multiple opponents, or a single opponent with a knife, as opposed to employing wiser (but not useful for MMA!) strategies like diplomacy, going armed, making better life choices, and running the fuck away.…
I have been wondering if there are intellectual equivalents of fake martial arts. The highly ritualized and hierarchical culture of the legal courtroom is one example of how epistemic battles are rendered civilized, and in some ways less real. Yet one can’t deny that legal systems are having a genuine and ongoing encounter with reality, from top to bottom. They are embedded in reality. Their fakenesses serve realities on other levels.
Self-proclaimed premium mediocre intellectual Venkatesh Rao defines premium mediocre:
Premium mediocre is the finest bottle of wine at Olive Garden. Premium mediocre is cupcakes and froyo. Premium mediocre is “truffle” oil on anything […], and extra-leg-room seats in Economy. Premium mediocre is cruise ships, artisan pizza, Game of Thrones, and The Bellagio.
Premium mediocre is food that Instagrams better than it tastes.
Adil Majid thinks it though with a little more precision.
A related concept might be bugman technology. Caveat: that term has snide connotations rooted in romantic struggle against a corporate feminism surveillance complex. This movement (?) does not seem to me to have a compelling diagnosis of the world’s ills.
Placeholder.
I work on public opinion, media and climate politics. My latest paper talks about how reporting on events looks very different depending on the ideological color of the media outlet.
Giulio Rossetti, author of various network analysis libraries such as dynetx and ndlib.
Writ large, Everything is Obvious (once you know the answer) is Duncan Watts’ laundry list of examples of how all our intuitions about how society works are self-justifying guesses divorced from evidence, except for his, because Yahoo let him build his own experimental online social networks. Bastard. Don’t let my jealousy stop you from reading the book, which will indeed reassure you of what the title says in some surprising ways.
That mimetic not memetic, although there are points of contact. Girard apparently wrote about our desires being often about being something rather than having something. Alex Danco summarizes a few choice morsels:
at a deep neurological level, when we watch other people and pattern our desires off theirs, we are not so much acquiring a desire for that object so much as learning to mimic somebody, and striving to become them or become like them. Girard calls this phenomenon mimetic desire. We don’t want; we want to be.
I do not know what neurological level he is attemptint to evoke here. Perhaps some of that mirror neuron business.
Modern status forums like Instagram are designed explicitly to bring out this dual admiration/resentment emotion within us. Instagram’s real product isn’t photos; it’s likes. The photos and the events they depict are just the transient objects that bubble up to the surface; what really matters is the relationship between the people. But the fact that Instagram’s product is built around the objects and not the models isn’t an accident: it’s sneaky. It creates way more space and oxygen for resentment and desperation to grow beneath the surface. It’s not about the photo or what it depicts; it’s always about the other person.
Or see Byrn Hobart:
We’re used to thinking of desire as something that emerges organically: you want something, and you try to get it. Sometimes, it’s easy; sometimes, there’s competition.
To Girard, that’s all wrong: you want something because of competition. Success is just a story you tell yourself about your desire for your rivals to fail.
A classic. See pluralistic ignorance
The self-perpetuation and amplification of some already difficult pathologies through the contemporary mediascape is where we are all collectively really doomed. e.g. Toxoplasma of rage by Scott Alexander
More important, unarmed black people are killed by police or other security officers about twice a week according to official statistics, and probably much more often than that. You’re saying none of these shootings, hundreds each year, made as good a flagship case as Michael Brown? In all this gigantic pile of bodies, you couldn’t find one of them who hadn’t just robbed a convenience store? Not a single one who didn’t have ten eyewitnesses and the forensic evidence all saying he started it?
I propose that the Michael Brown case went viral — rather than the Eric Garner case or any of the hundreds of others — because of the PETA Principle. It was controversial. A bunch of people said it was an outrage. A bunch of other people said Brown totally started it, and the officer involved was a victim of a liberal media that was hungry to paint his desperate self-defence as racist, and so the people calling it an outrage were themselves an outrage. Everyone got a great opportunity to signal allegiance to their own political tribe and discuss how the opposing political tribe were vile racists / evil race-hustlers. There was a steady stream of potentially triggering articles to share on Facebook to provoke your friends and enemies to counter-share articles that would trigger you.
Podcasts are a thing. Since being an academic has destroyed all joy from reading the written word, audio is my remaining narrative pleasure. (C&C audiobooks).
Podcasts are a vexing artform. Like many artforms delivered serially they tend to start ragged, peak, then become boring parodies of themselves. Presumably this is because it takes a while to build up an audience, and then target and profile that audience, so there is an incentive to produce never-ending content torrents, because that is the easiest format to financialise. As a result, they overflow the banks of their concept, or at least dilute it until the flavour is weak.
Even at the peak of any given show podcasts have highly variable killer-to-filler ratios. Accordingly I rate podcasts by how frequently they seem to me to be killer, as a hint as to how many episodes you might wish to audition before giving up and deciding my recommendation is not for you. The more a podcast apes the radio and commits to a regular schedule, or commits to advertisers and needs to make up volume, the more likely that it is that any given episode is unexciting filler. Alternative, some of them are not laser focussed on my interests alone, which is allowed I suppose.
Podcasts set the bar for for shallow engagement with science, history and current affairs. “I heard in a podcast that…” denotes that I hear something interesting but failed to treat it as important enough to follow up. This is probably what I should expect from an artfrom which is designed to keep me exactly engaged enough to do the dishes without breaking the dishes or getting bored.
On the plus side, podcasts for now open the door to some odd and interesting voices who entertain and occasionally inform me cheaply, and in a way that seems somehow less pathological and/or sleep disrupting than prestige TV shows.
So, recommendations. Having mention that recommending some of them is fraught because the unschooled might start on a waste-of-time or off-message episode, let us do it. These concerns aside, there is entertaining stuff being produced, and educational stuff, and titillating stuff, and stuff that gives a strong but baseless sense that it is educational by making you sound erudite at parties without requiring substantive effort or transformative understanding. It is hard to search for that last quality, but here, try a podcast search engine, Listen notes. Open Culture’s podcast list also has some excellent recommendations. It skews a little virtuous and improving for my smutty lowbrow tastes, but I got some good ideas from there.
For now I am dumping some names which I will link to and expound at some hypothetical time in the future when I have leisure, but for now I can at least remember that I need to reference them.
There are some other podcasts that I have auditioned, but none with so high an ROI for my own purposes as these. Ones that failed to make my list did so because
This is not to say that those other ones will not appeal to you, dear reader, just that they did not appeal to me so much that I diverted work time into linking to them.
Further, this is a growth area and there are more that I have not yet given a proper listen. Here are some more podcasts auditioning for a spot on the list.
Honourable mention, DeepMind: The Podcast (feed). Now defunct. In-house promotional podcast from Deepmind that does pretty well at explaining your AI job to your grandparents.
a.k.a. snob gossip.
My tastes run puerile. Everything here should be considered to have a content warning for swears, sexual content and lowbrowness.
When I lived in South east Asia the podcast scene was not massive. But now it is.
I would like some more podcasts from Indonesia and about Indonesia, but also the East-Asian/South-east-asian context generally.
Benjamin Walker’s theory of everything Unable to listen without contemplating how much more fun this should would be for me to record than to listen to.
Night vale. Enjoy the merch but did not enjoy the listening required to earn the moral claim to the merch.
Joe Rogan experience. I like the idea, but I do not have the time for this kind of marathon entertainment, for the same reason I do not have time for Netflix binges. I think that part of the attraction is supposed to be entering a fugue state of interviewness. I would possibly listen to Joe Rogan edited highlights?
See podcasting.
A jumble of electronic words upon the electronic jumbling of other words.
I am thinking about form here. You might find more content-related stuff at narrative or rhetoric.
An amulet is a kind of poem that depends on language, code, and luck. To qualify, a poem must satisfy these criteria:
- Its complete Unicode text is 64 bytes or less. [1]
- The hexadecimal SHA-256 hash of the text includes four or more 8s in a row. [2]
Jeff Noon, Cobralingus, and his “post-futurist manifesto”, Poemage, rhymedesign.
Machine-learning-based rhyme and portmenateau generation? Rhymebrain. Or perhaps botnik writer:
The web prototype of the Botnik predictive keyboard offers word suggestions based on any text you feed it. Load a text file via the menu in the top left, then write using the grid of options.
Tools to put interactive stories online, in a texty way.
Easy to play: We created a word-on-word interaction mechanic suitable for touchscreen phones and tablets, as well as web browsers.
Easy to create: Our WYSIWYG editor makes composition and design a right-brained, no code affair — right in the browser.
Easy to share: Click a button to publish and publicly share your work on social media. Or download an .html file to host it yourself or share via email.
See Em Short’s inform7 intro
Not sure.
ink claims to be a literate coding language for narrative that is more like writing than coding. Also
The powerful scripting language behind Heaven's Vault, 80 Days and Sorcery!
Presumably those were good games?
On tools for presentation.
The default options:
Both these are a colossal waste of time, adding little to my research while sucking energy into a black hole of trying to give a shit about niggly presentation and file type compatibility issues. Both, last time I looked, had a terrible mathematical equation typesetting work flow, although Keynote went beyond terrible to abysmal in this area. Although admittedly the last time I looked it was 2011.
An alternative strategy might re-use the documentation, code, maths markup and/or graphs from my actual research articles and code.
It turns out that this is not hard as such, merely harder than it should be. Read on for options.
Aside: I was going to try out the cloud-based Microsoft Sway which addresses some of my Powerpoint irritations. However it also supports no mathematics so is once again a non-starter for me.
These are secretly HTML slides, but that is more of an implementation detail. My current default choice, supporting web display, mathematical markup, code, and dynamic interaction. Recommended. My preferred option for RMarkdown is outputting HTML slides, covered below, but it is also worth mentioning that it can output PDF slides, and crucially, Powerpoint.
RMarkdown Powerpoint is simple.
The main trick is known the speaker notes, which look like this.
::: notes
This is a speaker note.
:::
Generate a PDF using your choice of technology Display the PDF in presentation mode using your PDF viewer. AFAICT this cannot play multimedia stuff.
Scribus is a good open source desktop publisher system
(think InDesign, but free.)
I discovered it while writing some posters, but it has
many features to recommend it for PDFs also, including an in-built
LaTeX math renderer and good support for vector graphics, and
it will even render arbitrary graphs from weird command-line software
automagically via render frames
.
Beamer
, the LaTeX slide thingy, also works.
It is hard to do too exciting a design but this in itself is perhaps a feature.
It lends a certain reassurance to the
audience that you are performing the rituals of academia in a Right and Proper fashion.
Note that BibTeX does not work with beamer. BibLaTeX, by contrast, seems to.
knitr does support beamer slides too, although why would you want to do that?
There is no hope of making beamer match my corporate style guide in any sensible timeframe, but I could choose the closest approximation by looking at the example theme matrix. OTOH, who actually meets corporate style guidelines? No academic I know at any academic conference I have been to.
Jupyter support reveal.js
HTML slide output. TBD.
In addition to the basic reveal.js integration in jupyter, there is even a convenient (although slightly restricted) version of reveal available to display and interact with jupyter notebooks as slideshows via RISE.
Customisation of style etc for RISE:
HTML slides are powerful because they can leverage all the powers of web browsers, which are a powerful execution platform these days. Many of the systems mentioned above output to HTML slides, but you can also use them bareback.
reveal.js and remark.js
are the best I have tried.
Reveal.js
seems more popular, although a little over-engineered.
There are many additional options listed below.
reveal.js is the poster child for HTML slides. the online editor makes it easier to collaborate with my non-HTML-nerd colleagues.
Creating themes seems to require you to fork reveal’s github repo for full generality, which feels a bit weird. Although in principle one can still just inject those CSS stylesheets as made famous by the web, right? Either way, in practice themeing can be a rabbit hole of flex-box and responsive media queries.
Do you need to preview slideshows to colleagues who are allergic to html? You can export as a PDF, although it’s not quite as immediate as you’d hope.
See also sundry workflow notes.
remark.js is mostly similar to Reveal.js, is a more minimalist maybe and a little tidier.
It has native R support via xaringan, which is Yihui Xe’s favourite. It has such luxuries as presenter notes and interactive style generator.
It was not immediately obvious what dialects of markdown are supported by remark.js
. (not pandoc!)
Specifically it supports
gfm and
commonmark.
The documentation for the former is superior.
The classic is Eric Meyer’s S5, although it’s showing its age. No longer recommended.
Elegant but no longer maintained, deck.js …
marp is an integrated markdown presentation writer
impress.js does prezi-style fancy slide animations. Documentaiton is spartan
DZslides is not my favourite but worth mentioning because it’s used as a knitr example, although in fact you could also use S5 or reveal.js.
an interactive tutorial on making interactive tutorials using d3.js
(which would also work fine with reveal.js
, presumably.)
## TODO lists
wunderpresentation turns other things into presentations. For the moment, notably, trello.
Presentations: The quantum of information for all parts of society for which the quantum of information is not a tweet or a Facebook status update. Powerpoint presentations are already purported to have various oft-cited defects but these I will not discuss here. In my trade they are a necessary evil. I’m all about harm minimisation of the evil, by minimising the amount of time I must waste on it.
So how do you actually give a good presentation?
See presentation tools.
On communicating to my peers and students.
Assertion-evidence is one school of presentation aesthetic (Garner and Alley 2013, 2016).
See perhaps, Dave Richeson, How to Present a Mathematical Proof or Problem. Popular: Patrick Winston’s How to speak self-demonstrating lecture best practice.
There is a tension between the need to keep slides text-filled to serve as a document of a lecture for reference, and a minimalist jumping-off point for thought.
Selling your work to non-specialists
Emma Donnelly of Comm-it has been coaching us on this.
Asch conformity experiments, Schelling points, illusory norms via pluralistic ignorance or majority illusions.
Anders Sandberg, on AI versus social norm enforcemant:
We are subject to norm enforcement from friends and strangers all the time. What is new is the application of media and automation. They scale up the stakes and add the possibility of automated enforcement […] . Automated enforcement makes the panopticon effect far stronger: instead of suspecting a possibility of being observed it is a near certainty. So the net effect is stronger, more pervasive norm enforcement…
…of norms that can be observed and accurately assessed. Jaywalking is transparent in a way being rude or selfish often isn’t. We may end up in a situation where we carefully obey some norms, not because they are the most important but because they can be monitored.
I think Anders is approaching Goodhardts law.
A jewish concept I frequently need: mar'it ayin
When I do something that looks wrong, even if I have a perfectly good and innocent explanation, the damage is done. …If I enter a non-kosher restaurant to use the facilities, while I have not broken any law of keeping kosher, I have bridged the divide between kosher and not kosher, and invite others to do the same. But there’s a deeper reason not to do something that just looks wrong, even if it isn’t wrong, and even if no one is looking. And that is because not only can such activity affect others, it can affect us too. Actors know that when you play a character, you can sometimes become that character. The self we project to others can sometimes be absorbed into our own identity. And so by looking like you are doing something wrong, you may come to actually do it.
Not that this is a precept for my life, as such, but I do need to discuss this concept from time to time. In particular, I think this gets at an element of conspicuous morality that is it is best not to dismiss as “mere” virtue signalling.
A computer symbolic algebra system.
I’m all about open-source tools, as a rule.
Mathematica is not that.
But the fact remains that the
best table of integrals that exists is Mathematica,
that emergent epiphenomenon of the cellular automaton that implements Stephen Wolfram’s mind.
I should probably work out what else it does, while I have their seductively cheap student-license edition chugging away. begrudging contested access to a small number of corporate licenses.
What are the most common pitfalls awaiting new users?
The substitution operator is
/.
which is terrible to search for.
The help easiest to find under the alias ReplaceAll
.
{x, x^2, y, z} /. x -> 1
{x, x^2, y, z} /. {x -> 1,y ->2, z->x}
Typing symbols is easy; just use the combination of Esc
and autocomplete.
Most tutorials have you executing everything in promiscuous global scope.
Since I am not a heavy Mathematica user, half my time is spent debugging problems with stale definitions and weird scope behaviour.
A large part of what remains is worrying about when code is executed.
You can get a local scope with Block
.
Multiple clashing function definitions will hang around silently conflicting with one another;
use ClearAll
to remove all the definitions of a term to avoid this.
As David Reiss points out, if I define
g[x_]:=x^2
g[2]:="cheese"
then when I execute g[2]
I get "cheese"
and not 4
.
This is also about evaluation time — for more on that see below.
The easiest way to get a fresh start for some overloaded name, as far as I can see, is:
ClearAll["Global`*"]
Keywords to understand are Hold
, and Delayed
.
The confusing terminology here is pure functions.
dsaad[t_] = Q[t] /. First @ DSolve[{Q''[t] + 40 Q'[t] + 625 Q[t] == 100*Cos[10*t],
Q[0] == 0, Q'[0] == 0}, Q, t]
Here are some works examples of the nuts nd abolts of this:Function output from DSolve, How to define a function based on the output of DSolve?.
Here are some links that I have found useful.
You want a fancy basis for your vector space? Try frames! You might care in this case about restricted isometry properties.
Morgenshtern and Bölcskei (Morgenshtern and Bölcskei 2011):
Hilbert spaces and the associated concept of orthonormal bases are of fundamental importance in signal processing, communications, control, and information theory. However, linear independence and orthonormality of the basis elements impose constraints that often make it difficult to have the basis elements satisfy additional desirable properties. This calls for a theory of signal decompositions that is flexible enough to accommodate decompositions into possibly nonorthogonal and redundant signal sets. The theory of frames provides such a tool. This chapter is an introduction to the theory of frames, which was developed by Duffin and Schaeffer (Duffin and Schaeffer 1952) and popularized mostly through (Ingrid Daubechies 1992; I. Daubechies 1990; Heil and Walnut 1989; YoungIntroduction2001?). Meanwhile frame theory, in particular the aspect of redundancy in signal expansions, has found numerous applications such as, e.g., denoising, code division multiple access (CDMA), orthogonal frequency division multiplexing (OFDM) systems, coding theory, quantum information theory, analog-to-digital (A/D) converters, and compressive sensing (Candès and Tao 2006; David L. Donoho 2006; David L. Donoho and Elad 2003). A more extensive list of relevant references can be found in (Kovačević and Chebira 2008). For a comprehensive treatment of frame theory we refer to the excellent textbook (Christensen 2016).
A compact signal-processing-oriented intro for engineers is Jorgensen and Song (2007).
Quantifying difference between probability measures. Measuring the distribution itself, for, e.g. badness of approximation of a statistical fit. The theory of binary experiments. You probably care about these because you want to work with empirical observations of data drawn from a given distribution, to test for independence or do hypothesis testing or model selection, or density estimation, or to model convergence for some random variable, or probability inequalities, or to model the distinguishability of the distributions from some process and a generative model of it, as seen in generative adversarial learning. That kind of thing. Frequently the distance here is between a measure and an empirical estimate thereof, but this is no requirement.
A good choice of probability metric might give you a convenient distribution of a test statistic, an efficient loss function to target, simple convergence behaviour for some class of estimator, or simply a warm fuzzy glow.
“Distance” and “metric” both often imply symmetric functions obeying the triangle inequality, but on this page we have a broader church, and include pre-metrics, metric-like functions which still “go to zero when two things get similar”, without including the other axioms of distances. These are also called divergences. This is still useful for the aforementioned convergence results. I’ll use “true metric” or “true distance” to make it clear when needed. “Contrast” is probably better here, but is less common.
🏗 talk about triangle inequalities.
nle;dr Don’t read my summary, read the summaries. On intersting one, although it pre-dated the renewed mania for Wasserstein metrics, is the Reid and Williamson epic, (Reid and Williamson 2011), which, in the quiet solitude of my own skull, I refer to as One regret to rule them all and in divergence bound them.
There is also a nifty omnibus of classic relations in Gibbs and Su:
Relationships among probability metrics. A directed arrow from A to B annotated by a function \(h(x)\) means that \(d_A \leq h(d_B)\). The symbol diam Ω denotes the diameter of the probability space Ω; bounds involving it are only useful if Ω is bounded. For Ω finite, \(d_{\text{min}} = \inf_{x,y\in\Omega} d(x,y).\) The probability metrics take arguments μ,ν; “ν dom μ” indicates that the given bound only holds if ν dominates μ. […]
Yuling Yao gives us an intuition by considering point mass approximations
- The mean of posterior density minimizes the L2 risk. The mode of the posterior density minimizes the KL divergence to it. … Put it in another way, the MAP is always the spiky variational inference approximation to the exact posterior density.
- …The posterior median minimizes the Wasserstein metric for order 1 and he posterior mean minimizes the Wasserstein metric for order 2.
Well now, this is my fancy name. But this is probably the most familiar to many, as it’s a vanilla functional norm-induced metric applied to probability distributions on the state space of the random variable.
The “usual” norms can be applied to density, Most famously, \(L_p\) norms (which I will call \(L_k\) norms because I am using \(p\)).
When written like this, the norm is taken between densities, i.e. Radon-Nikodym derivatives, not distributions. (Although see the Kolmogorov metric for an application of the \(k=\infty\) norm to cumulative distributions.)
A little more generally, consider some RV \(X\sim P\) taking values on \(\mathbb{R}\) with a Radon-Nikodym derivative (a.k.a. density) continuous with respect to the Lebesgue measure \(\lambda\), \(p=dP/d\lambda\).
\[\begin{aligned} L_k(P,Q)&:= \left\|\frac{dP-dQ}{d\lambda}\right\|_k\\ &=\left[\int \left(\frac{dP-dQ}{d\lambda}\right)^k d\lambda\right]^{1/k}\\ &=\mathbb{E}\left[\frac{dP-dQ}{d\lambda}^k \right]^{1/k} \end{aligned}\]
\(L_2\) norm are classics for kernel density estimates, because it allows you to use lots of tasty machinery of spectral function approximation.
\(L_k, k\geq 1\) norms do observe the triangle inequality, and \(L_2\) norms have lots of additional features, such as Wiener filtering formulations, and Parseval’s identity, and you get a convenient Hilbert space for free.
There are the standard facts about about \(L_k,\,k\geq 1\) spaces (i.e. expectation of arbitrary measurable functions), e.g. domination
\[k>1 \text{ and } j>k \Rightarrow \|f\|_k\geq\|g\|_j\]
Hölder’s inequality for probabilities
\[1/k + 1/j \leq 1 \Rightarrow \|fg\|_1\leq \|f\|_k\|g\|_j\]
and the Minkowski (i.e. triangle) inequality
\[\|x+y\|_k \leq \|x\|_k+\|y\|_k\]
However, it’s an awkward choice for a distance on a probability space, the \(L_k\) space on densities.
If you transform the random variable by anything other than a linear transform, then your distances transform in an arbitrary way. And we haven’t exploited the non-negativity of probability densities so it might feel as if we are wasting some information — If our estimated density \(q(x)<0,\;\forall x\in A\) for some non empty interval \(A\) then we know it’s plain wrong, since probability is never negative.
Also, such norms are not necessarily convenient. Exercise: Given \(N\) i.i.d samples drawn from \(X\sim P= \text{Norm}(\mu,\sigma)\), find a closed form expression for estimates \((\hat{\mu}_N, \hat{\sigma}_N)\) such that the distance \(E_P\|(p-\hat{p})\|_2\) is minimised.
Doing this directly is hard; But indirectly can work — if we try to directly minimise a different distance, such as the KL divergence, we can squeeze the \(L_2\) distance. 🏗 come back to this point.
Finally, these feel like setting up an inappropriate problem to solve statistically, since an error is penalised equally everywhere in the state-space; Why are errors penalised just as much for where \(p\simeq 0\) as for \(p\gg 0\)? Surely there are cases where we care more, or less, about such areas? That leads to, for example…
Why not call \(P\) close to \(Q\) if closeness depends on the probability weighting of that place? Specifically, some divergence \(R\) like this, using scalar function \(\phi\) and pointwise loss \(\ell\)
\[R(P,Q):=\psi(E_Q(\ell(p(x), q(x))))\]
If we are going to measure divergence here, we also want the properties that \(P=Q\Rightarrow R(P,Q)=0\), and \(R(P,Q)\gt 0 \Rightarrow P\neq Q\). We can get this if we chose some increasing \(\psi\) and \(\ell(s,t)\) such that
\[ \begin{aligned} \begin{array}{rl} \ell(s,t) \geq 0 &\text{ for } s\neq t\\ \ell(s,t)=0 &\text{ for } s=t\\ \end{array} \end{aligned} \]
Let \(\psi\) be the identity function for now, and concentrate on the fiddly bit, \(\ell\). We try a form of function that exploits the non-negativity of densities and penalises the derivative of one distribution with respect to the other (resp. the ratio of densities) :
\[\ell(s,t) := \phi(s/t)\]
If \(p(x)=q(x)\) then \(q(x)/p(x)=1\). So to get the right sort of penalty, we choose \(\phi\) to have a minimum where the argument is 1, \(\phi(1)=0\) and \(\phi(t)\geq 0, \forall t\)
It turns out that it’s also wise to take \(\phi\) to be convex. (Exercise: why?) And, note that for these not to explode we will now require \(P\) be dominated by \(Q.\) (i.e. \(Q(A)=0\Rightarrow P(A)=0,\, \forall A \in\text{Borel}(\mathbb{R})\)
Putting this all together, we have a family of divergences
\[D_\phi(P,Q) := E_Q\phi\left(\frac{dP}{dQ}\right)\]
And BAM! These are the \(\phi\)-divergences. You get a different one for each choice of \(\phi\).
a.k.a. Csiszár-divergences, \(f\)-divergences or Ali-Silvey distances, after the people who noticed them. (Csiszár 1972; Ali and Silvey 1966)
These are in general mere premetrics. And note they are no longer in general symmetric -We should not necessarily expect
\[D_\phi(Q,P) = E_P\phi\left(\frac{dQ}{dP}\right)\]
to be equal to
\[D_\phi(P,Q) = E_Q\phi\left(\frac{dP}{dQ}\right)\]
Anyway, back to concreteness, and recall our well-behaved continuous random variables; we can write, in this case,
\[D_\phi(P,Q) = \int_\mathbb{R}\phi\left(\frac{p(x)}{q(x)}\right)q(x)dx\]
Let’s explore some \(\phi\)s.
We take \(\phi(t)=t \ln t\), and write the corresponding divergence, \(D_\text{KL}=\operatorname{KL}\),
\[\begin{aligned} \operatorname{KL}(Q,P) &= E_Q\phi\left(\frac{p(x)}{q(x)}\right) \\ &= \int_\mathbb{R}\phi\left(\frac{p(x)}{q(x)}\right)q(x)dx \\ &= \int_\mathbb{R}\left(\frac{p(x)}{q(x)}\right)\ln \left(\frac{p(x)}{q(x)}\right) q(x)dx \\ &= \int_\mathbb{R} \ln \left(\frac{q(x)}{p(x)}\right) p(x)dx \end{aligned}\]
Indeed, if \(P\) is absolutely continuous wrt \(Q\),
\[\operatorname{KL}(P,Q) = E_Q\log \left(\frac{dP}{dQ}\right)\]
This is one of many possible derivations of the Kullback-Leibler divergence a.k.a. KL divergence, or relative entropy; It pops up because of, e.g., information-theoretic significance.
🏗 revisit in maximum likelihood and variational inference settings, where we have good algorithms exploiting its nice properties.
Take \(\phi(t)=|t-1|\). We write \(\delta(P,Q)\) for the divergence. I will use the set \(A:=\left\{x:\frac{dP}{dQ}\geq 1\right\}=\{x:dP\geq dQ\}.\)
\[\begin{aligned} \delta(P,Q) &= E_Q\left|\frac{dP}{dQ}-1\right| \\ &= \int_A \left(\frac{dP}{dQ}-1 \right)dQ - \int_{A^C} \left(\frac{dP}{dQ}-1 \right)dQ\\ &= \int_A \frac{dP}{dQ} dQ - \int_A 1 dQ - \int_{A^C} \frac{dP}{dQ}dQ + \int_{A^C} 1 dQ\\ &= \int_A dP - \int_A dQ - \int_{A^C} dP + \int_{A^C} dQ\\ &= P(A) - Q(A) - P(A^C) + Q(A^C)\\ &= 2[P(A) - Q(A)] \\ &= 2[Q(A^C) - P(A^C)] \\ \text{ i.e. } &= 2\left[P(\{dP\geq dQ\})-Q(\{dQ\geq dP\})\right] \end{aligned}\]
I have also the standard fact that for any probability measure \(P\) and \(P\)-measurable set, \(A\), it holds that \(P(A)=1-P(A^C)\).
Equivalently
\[\delta(P,Q) :=\sup_{B \in \sigma(Q)} \left\{ |P(B) - Q(B)| \right\}\]
To see that \(A\) attains that supremum, we note for any set \(B\supseteq A,\, B:=A\cup D\) for some \(Z\) disjoint from \(A\), it follows that \(|P(B) - Q(B)|\leq |P(A) - Q(A)|\) since, on \(Z,\, dP/dQ\leq 1\), by construction.
It should be clear that this is symmetric.
Supposedly, (Khosravifard, Fooladivanda, and Gulliver 2007) show that this is the only possible f-divergence which is also a true distance, but I can’t access that paper to see how.
🏗 Prove that for myself -Is the representation of divergences as “simple” divergences helpful? See (Reid and Williamson 2009) (credited to Österreicher and Wajda.)
Interestingly as Djalil Chafaï points out,
\[ \delta(P,Q) =\inf_{X \sim P,Y \sim Q}\mathbb{P}(X\neq Y) \]
For this one, we write \(H^2(P,Q)\), and take \(\phi(t):=(\sqrt{t}-1)^2\). Step-by-step, that becomes
\[\begin{aligned} H^2(P,Q) &:=E_Q \left(\sqrt{\frac{dP}{dQ}}-1\right)^2 \\ &= \int \left(\sqrt{\frac{dP}{dQ}}-1\right)^2 dQ\\ &= \int \frac{dP}{dQ} dQ -2\int \sqrt{\frac{dP}{dQ}} dQ +\int dQ\\ &= \int dP -2\int \sqrt{\frac{dP}{dQ}} dQ +\int dQ\\ &= \int \sqrt{dP}^2 -2\int \sqrt{dP}\sqrt{dQ} +\int \sqrt{dQ}^2\\ &=\int (\sqrt{dP}-\sqrt{dQ})^2 \end{aligned}\]
It turns out to be another symmetrical \(\phi\)-divergence. The square root of the Hellinger divergence \(H=\sqrt{H^2}\) is the Hellinger distance on the space of probability measures which is a true distance. (Exercise: prove.)
It doesn’t look intuitive, but has convenient properties for proving inequalities (simple relationships with other norms, triangle inequality) and magically good estimation properties (Beran 1977), e.g. in robust statistics.
🏗 make some of these “convenient properties” explicit.
For now, see Djalil who defines both Hellinger distance
\[\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.\]
and Hellinger affinity
\[\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).\]
a.k.a Rényi divergences, which are a sub family of the f divergences with a particular parameteriation. Includes KL, reverse-KL and Hellinger as special cases.
We take \(\phi(t):=\frac{4}{1-\alpha^2} \left(1-t^{(1+\alpha )/2}\right).\)
This gets fiddly to write out in full generality, with various undefined or infinite integrals needing definitions in terms of limits and is supposed to be constructed in terms of “Hellinger integral”…? I will ignore that for now and write out a simple enough version. See (Liese and Vajda 2006; van Erven and Harremoës 2014) for gory details.
\[D_\alpha(P,Q):=\frac{1}{1-\alpha}\log\int \left(\frac{p}{q}\right)^{1-\alpha}dP\]
As made famous by count data significance tests.
For this one, we write \(\chi^2\), and take \(\phi(t):=(t-1)^2\). Then, by the same old process…
\[\begin{aligned} \chi^2(P,Q) &:=E_Q \left(\frac{dP}{dQ}-1\right)^2 \\ &= \int \left(\frac{dP}{dQ}-1\right)^2 dQ\\ &= \int \left(\frac{dP}{dQ}\right)^2 dQ - 2 \int \frac{dP}{dQ} dQ + \int dQ\\ &= \int \frac{dP}{dQ} dP - 1 \end{aligned}\]
Normally you see this for discrete data indexed by \(i\), in which case we may write
\[\begin{aligned} \chi^2(P,Q) &= \left(\sum_i \frac{p_i}{q_i} p_i\right) - 1\\ &= \sum_i\left( \frac{p_i^2}{q_i} - q_i\right)\\ &= \sum_i \frac{p_i^2-q_i^2}{q_i}\\ \end{aligned}\]
If you have constructed these discrete probability mass functions from \(N\) samples, say, \(p_i:=\frac{n^P_i}{N}\) and \(q_i:=\frac{n^Q_i}{N}\), this becomes
\[\chi^2(P,Q) = \sum_i \frac{(n^P_i)^2-(n^Q_i)^2}{Nn^Q_i}\]
This is probably familiar from some primordial statistics class.
The main use of this one is its ancient pedigree, (used by Pearson in 1900, according to Wikipedia) and its non-controversiality, so you include it in lists wherein you wish to mention you have a hipper alternative.
🏗 Reverse Pinsker inequalities (e.g. (Berend, Harremoës, and Kontorovich 2012)), and covering numbers and other such horrors.
Wrt the total variation distance,
\[H^2(P,Q) \leq \delta(P,Q) \leq \sqrt 2 H(P,Q)\,.\]
\[H^2(P,Q) \leq \operatorname{KL}(P,Q)\]
Additionally,
\[0\leq H^2(P,Q) \leq H(P,Q) \leq 1\]
(Berend, Harremoës, and Kontorovich 2012) attribute this to Csiszár (1967 article I could not find) and Kullback (Kullback 1970, 1967) instead of (Pinsker 1980) (which is in any case in Russian and I haven’t read it).
\[\delta(P,Q) \leq \sqrt{\frac{1}{2} D_{K L}(P\|Q)}\]
(Reid and Williamson 2009) derive the best-possible generalised Pinsker inequalities, in a certain sense of “best” and “generalised”, i.e. they are tight bounds, but not necessarily convenient.
Here are the most useful 3 of their inequalities: (\(P,Q\) arguments omitted)
\[\begin{aligned} H^2 &\geq 2-\sqrt{4-\delta^2} \\ \chi^2 &\geq \mathbb{I}\{\delta\lt 1\}\delta^2+\mathbb{I}\{\delta\lt 1\}\frac{\delta}{2-\delta}\\ \operatorname{KL} &\geq \min_{\beta\in [\delta-2,2-\delta]}\left(\frac{\delta+2-\beta}{4}\right) \log\left(\frac{\beta-2-\delta}{\beta-2+\delta}\right) + \left(\frac{\beta+2-\delta}{4}\right) \log\left(\frac{\beta+2-\delta}{\beta+2+\delta}\right) \end{aligned}\]
🏗 For now, see Smola et al. (2007). Weaponized in Gretton et al. (2008) as an independence test.
Included:
Total Variation
Kantorovich/Wasserstein/Mass transport. (🏗 make precise)
Fourtet-Mourier
Lipschitz (?)
Maximum Mean Discrepancy, esp using RKHS-based,e.g. (Smola et al. 2007). Homework: Can you use RKHS methods in all of these?
Analysed in MIntegra probability metrics.
This got complicated. See Optimal transport metrics, where I broke out the info into its own page for my ease of reference.
Specifically \((p,\nu)\)-Fisher distances, in the terminology of (J. H. Huggins et al. 2018). They use these distances as a computationally tractable proxy for Wasserstein distance.
For a Borel measure \(\nu\), let \(L_p(\nu)\) denote the space of functions that are \(p\)-integrable with respect to \(\nu: \phi \in L_p(\nu) \Leftrightarrow \|\phi\|Lp(\nu) = (\int\phi(\theta)p\nu(d\theta)))^{1/p} < \infty\). Let \(U = − \log d\eta/d\theta\) and \(\hat{U} = − \log d\hat{\eta}/d\theta\) denote the potential energy functions associated with, respectively, \(\eta\) and \(\hat{\eta}\).
Then the \((p, \nu)\)-Fisher distance is given by \[\begin{aligned} d_{p,\nu}(\eta,\hat{\eta}) &=\left\|\|\nabla U−\nabla U\|_2\right\|_{L^p(\nu)}\\ &= \left(\int\|\nabla U(\theta)−\nabla U(\theta)\|_2^p\nu(d\theta)\right)^{1/p}. \end{aligned}\]
This avoids an inconvenient posterior normalising calculation in Bayes.
\[D_L(P,Q) := \inf\{\epsilon >0: P(x-\epsilon)-\epsilon \leq Q(x)\leq P(x+\epsilon)+\epsilon\}\]
\[D_K(P,Q):= \sup_x \left\{ |P(x) - Q(x)| \right\}\]
Nonetheless it does look similar to Total Variation, doesn’t it?
What even are the Kuiper and Prokhorov metrics?
There is a synthesis of the importance of the topologies induced by each of these metrics, which I read in (Arjovsky, Chintala, and Bottou 2017), and which they credit to (Billingsley 2013; Villani 2009).
In this paper, we direct our attention on the various ways to measure how close the model distribution and the real distribution are, or equivalently, on the various ways to define a distance or divergence \(\rho(P_{\theta},P_{r})\). The most fundamental difference between such distances is their impact on the convergence of sequences of probability distributions. A sequence of distributions \((P_{t}) _{t\in \mathbb{N}}\) converges if and only if there is a distribution \(P_{\infty}\) such that \(\rho(P_{\theta} , P_{r} )\) tends to zero, something that depends on how exactly the distance \(\rho\) is defined. Informally, a distance \(\rho\) induces a weaker topology when it makes it easier for a sequence of distribution to converge. […]
In order to optimize the parameter \({\theta}\), it is of course desirable to define our model distribution \(P_{\theta}\) in a manner that makes the mapping \({\theta} \mapsto P_{\theta}\) continuous. Continuity means that when a sequence of parameters \(\theta_t\) converges to \({\theta}\), the distributions \(P_{\theta_t}\) also converge to \(P{\theta}.\) […] If \(\rho\) is our notion of distance between two distributions, we would like to have a loss function \(\theta \mapsto\rho(P_{\theta},P_r)\) that is continuous […]
In which I discover for myself whether “multi-task” and “co-regionalized” approaches are different. Álvarez, Rosasco, and Lawrence (2012)
Overview from Invenia: Gaussian Processes: from one to many outputs
[the] community has begun to turn its attention to covariance functions for multiple outputs. One of the paradigms that has been considered (Bonilla, Chai, and Williams 2007; Osborne et al. 2008; Seeger, Teh, and Jordan 2005) is known in the geostatistics literature as the linear model of coregionalization (LMC). In the LMC, the covariance function is expressed as the sum of Kronecker products between coregionalization matrices and a set of underlying covariance functions. The correlations across the outputs are expressed in the coregionalization matrices, while the underlying covariance functions express the correlation between different data points.
Multitask learning has also been approached from the perspective of regularization theory (Evgeniou and Pontil, 2004; Evgeniou et al., 2005). These multitask kernels are obtained as generalizations of the regularization theory to vector-valued functions. They can also be seen as examples of LMCs applied to linear transformations of the input space.
This repository provides a toolkit to perform multi-output GP regression with kernels that are designed to utilize correlation information among channels in order to better model signals. The toolkit is mainly targeted to time-series, and includes plotting functions for the case of single input with multiple outputs (time series with several channels).
The main kernel corresponds to Multi Output Spectral Mixture Kernel, which correlates every pair of data points (irrespective of their channel of origin) to model the signals. This kernel is specified in detail in Parra and Tobar (2017)
Video chats for meetings and the like. One-on-one services… we have that sorted elsewhere — In particular most chat clients seem to go at least OK-ish.^{1}
Also Hangouts, Zoom, Red Dead Redemption, Gather…
Basic level: How to protect your meetings from zoom bombing. How to run a productive meeting.
Smartarse level: Automate your participation in pointless zoom meetings! Style transfer your face to some other face while videoconferencing for fun and amusement! Automatically switch off the video if your underpants are visible! Play space invaders. Get a better webcam.
Benedict Evans Solving online events
A conference, or an ‘event’, is a bundle. There is content from a stage, with people talking or presenting or doing panels and maybe taking questions. Then, everyone talks to each other in the hallways and over coffee and lunch and drinks. […] there are all of the meetings that you schedule because everyone is there. At a really big ‘conference’ many people don’t even go to the actual event itself. […]
The only part of that bundle that obviously works online today is the content. It’s really straightforward to turn a conference presentation or a panel into a video stream, but none of the rest is straightforward at all.
First, we haven’t worked out good online tools for many of the reasons people go to these events. Most obviously, we don’t have any software tool for bumping into people in the same field by random chance and having a great conversation. No-one has ever really managed to take a networking event and put it online.[…] In other words, some conferences are built around creating a network in the hallways. If you take them online, there are no hallways.
Webb on simulating coffee-table encounters mentions an example at “!!Con 2020”. There are many attempts, e.g. Minglr (Song, Riedl, and Malone 2020) is a tool designed to facilitate mingling.
Can a better UI solve this issue? Robert Fabricant We’re in a golden age of UX. Why is video chat still stuck in the ’90s?
Also, who decided that a scrolling, vertical feed was the right way to represent the comments and input from the broader group of participants in Zoomland? What if our questions hung in the air around our heads (like thought bubbles) until we felt that they were sufficiently addressed? What if the other participants could migrate closer to us to show that they too, want to discuss that issue or topic? Perhaps the speech bubble could grow if others “join” and begin to crowd into the video window of the person who is leading the discussion until they respond?
Center for Scientific Collaboration and Community Engagement studies this stuff, producing guides, e.g. Woodley et al. (2020), pitched as Designing engaging online events to counteract Zoom fatigue and get stuff done
Groups of six people or less are randomly assigned to breakout rooms, providing participants the opportunity to network and collaborate with new people. (It can be difficult for all participants to feel included and contribute within larger groups). Participants take turns introducing themselves and something about the work they are doing, the work they would like to do, or issues that are important to them. The group then decides on which idea to take forward, and collectively brainstorms how to get the project off the ground or how to solve the problem. A Collaborative Ideas session works well as a 60-90 minute session, providing enough time for participants to explore, discuss, and transcribe their ideas.
ICLR 2020 seems to be widely acclaimed as a best-practice online conference. Their post about it should be mined for tips. Also because they are big nerds, they did machine learning on themselves
Where it made sense, we wanted to use machine learning in the design of the conference programme and experience itself. We used a latent variable model for review score calibration, a vision model to extract thumbnails from each paper to be used on the web, natural language tools to visualise related papers, and recommender systems to create sets of balanced recommendations for papers and participants
gather.town attempts to simulated an 8-bit-videogame style virtual environment for discussion and so on. Very popular in the ML conference circuit.
Similar: Lozya
Jitsi is open source and free, and AFAICT the only one here which makes an attempt to protect your data from the company which hosts it. It’s browser based and requires no download on desktop.
Go ahead, video chat with the whole team. In fact, invite everyone you know. Jitsi Meet is a fully encrypted, 100% open source video conferencing solution that you can use all day, every day, for free — with no account needed.
I would like to know about the moderation features.
Having used it successfully I recommend it; I handles unlimited length meetings with high security, unlike commercial competitors. I am told it can in practice handle up to 35-ish participants. It will permit up to 75 to attempt to meet.
Minglr, mentioned above, is a hack of jitsi designed to foment ad hoc, serendipitous discussion. I do not know how good it is, but the fact that jitsi enables this kind of thing is a good argument for it.
Even though many people have found today’s commonly used videoconferencing systems very useful, these systems do not provide support for one of the most important aspects of in-person meetings: the ad hoc, private conversations that happen before, after, and during the breaks of scheduled events—the proverbial hallway conversations. Here we describe our design of a simple system, called Minglr, which supports this kind of interaction by facilitating the efficient matching of conversational partners. We also describe a study of this system’s use at the ACM Collective Intelligence 2020 virtual conference. Analysis of our survey and system log data provides evidence for the usefulness of this capability, showing, for example, that 86% of people who used the system successfully at the conference thought that future virtual conferences should include a tool with similar functionality. We expect similar functionality to be incorporated in other videoconferencing systems and to be useful for many other kinds of business and social meetings, thus increasing the desirability and feasibility of many kinds of remote work and socializing.
mmhmm is a multi-person-presentation-oriented video conference app.
Zoom seems to be the current default. It is famed for its lackadaisical approach to confidentiality users and their conversations:
Zoom’s privacy page states:
“Whether you have Zoom account or not, we may collect Personal Data from or about you when you use or otherwise interact with our Products.”
This includes, but is not limited to, your physical address, phone number, your job title, credit and debit card information, your Facebook account, your IP address, your OS and device details, and more.
It has been caught covertly tracking you for Facebook. (UPDATE: Fixed when busted.) It actively monitors content in some ill-explained way. There has been such a copious fountain of zoom security gaffes that zoom-bashing feels like a nearly as large a growth industry. I call this the Zoom gloom boom. The latest zoom-gloomer is the company’s own CEO who admitted they intend to monitor users.
Corporate clients will get access to Zoom’s end-to-end encryption service now being developed, but Yuan said free users won’t enjoy that level of privacy, which makes it impossible for third parties to decipher communications.
“Free users for sure we don’t want to give that because we also want to work together with FBI, with local law enforcement in case some people use Zoom for a bad purpose,” Yuan said on the call.
As with Skype, you can reduce the harm of this bit of spyware by not using the app, instead opting to keep the video chat sandboxed in your browser. Regardless of such precautions, it, like skype, is not end-to-end encrypted so you should expect your calls to get intercepted by nation-state or other in addition to whatever the company does internally with your data.
Skype is elderly video software whose major advantage is inertia. Downsides include the fact that it is honeypot spyware, that reads your passwords and records your messages for unaccountable american surveillance programs. But you have some colleagues who are determined to use it, of course. You could run it in a docker jail, but it is probably simpler to use the web client. This will still let them monitor your calls, but at least it won’t waste your disk space or require you to install their suspect dumpster fire app.
This is the message I sent around my house:
Oh my friends, my beloved flatties! My fine and fair fellow travelers! How proud I am of your assiduous hydration! How proud I am of your attention to staunchly high fibre diets! How amply these are each demonstrated by both the marvellous power of your streams of urine, and the magnificent gravity of dambuster turds you are capable of dropping! However, may I recommend if you use the upstairs toilet when I am in a video meeting, you close BOTH doors to mask the sound. Because otherwise, anything between 1 and 100 people across our fair nation will also be acoustically invited to admire your laudably prodigious natural functions. And then: How could I restrain myself from enthusing about said functions at length in the middle of a meeting? Then my meeting will be off the rails and our nation’s science will grind to a halt! What I am saying is: Please help me to restrain myself from needing to gush effusively about your effusive gushing.
Maybe check out someone else’s listing if you want to go deeper on that question.↩︎
Three wyrd sisters, poster, presentation and paper attend upon the ritual for conjuring an academic career. Of these, poster is the least regarded, but is sometimes necessary for all that.
Get some hot advice from…
See PDFs for some useful practical mechanics of design and printing.
Pro-tip: if you are plotting using python, one should be aware that seaborn has a poster mode for scaling line widths and fonts sensibly.
I do not know if biorender poster is any good, but it was created AFAICT by scientific communication experts, unlike many tools in this domain which were created by scientists or graphic designers, or in the case of Microsoft Powerpoint, a committee containing none of the above.
Scribus is a good open source desktop publisher system (think InDesign, but free, with the good and bad that that entails.)
The in-built LaTeX renderer does not support big font sizes per default but one can force that manually by overriding the supplied preamble.
This still doesn’t get you the correct margins, which matters for long equations. For that you need Chloé-Agathe Azencott’s LaTeX geometry hacks. Combining these we get this kind of thing
\documentclass[$scribus_fontsize$]{extarticle}
\usepackage[left=0cm,top=0cm,right=0cm,bottom=0cm,nohead,nofoot, paperwidth=$scribus_realwidth$pt,paperheight=$scribus_realheight$ pt]{geometry}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{xcolor}
\usepackage{fourier}% uses Utopia font for text and math
\begin{document}
{
\fontsize{48pt}{48pt}
\selectfont
\begin{align*}
\mathcal{A}\{c\phi\}(\xi)&= c^2\mathcal{A}\{\phi\}(\xi)\\
\mathcal{A}\{\phi(r t)\}(\xi)&= \frac{1}{r} \mathcal{A}\{\phi\}\left(\frac{\xi}{r}\right)\\
\mathsf{E}\left[ \mathcal{A}\{S_1\phi + S_2\phi'\}(\xi)\right]
&=\mathcal{A}\{\phi\}(\xi)+ \mathcal{A}\{\phi'\}(\xi) \\
\end{align*}
where\(\{S_i\}\) are i.i.d. Rademacher variables.
}
\end{document}
This page mostly exists to collect a good selection of overview statistics introductions that are not terrible. I’m especially interested in modern fusion methods that harmonise what we would call statistics and machine learning methods, and the unnecessary terminological confusion between those systems.
Here are some recommended courses to get started if you don’t know what you’re doing.
See also the recommended texts below. May I draw your attention especially to Kroese et al. (2019), which I proof-read for my supervisor Zdravko Botev, and enjoyed greatly? It smoothly bridges non-statistics mathematicians into applied statistics, without being excruciating, unlike layperson introductions.
There are also statistics podcasts.
Socialist calculation debate. Computational complexity and command-and-control-economics?
Classic: Cosma Shalizi: In Soviet Union, Optimization Problem Solves You.
Levine (2016-08-24T19:05:38.297Z, 2016-08-24T19:05:38.297Z, 2016-08-24T19:05:38.297Z) has a coda:
But as a wild extrapolation of the far future of financial capitalism, I submit to you that it is less silly than the "Silent Road to Serfdom" thesis. That thesis is that, in the long run, financial markets will tend toward mindlessness, a sort of central planning — by an index fund — that is worse than 1950s communism because it’s not even trying to make the right decisions.
The alternative view is that, in the long run, financial markets will tend toward perfect knowledge, a sort of central planning — by the Best Capital Allocating Robot — that is better than Marxism because it is perfectly informed and ideally rational.
Also interesting linkage to other thinkpieces on various capitalism ↔︎ communism convergence.
I do not have time for this now and so will say little apart from bookmarking items for future return. I would like to have a better notion of what systems of incentives we might submit ourselves to in order to gbe able to make a better claim that our journalism is informing us about the world in the manner in which we need it. This is related to the parallel problem in academia of how peer review should work. In the expensive but essential task of verifying and prioritising news and events for public decision making, how can we make it happen the most efficiently? Can the profit centres be shifted, e.g. from raving social media conspiracy theory and pile-ons to investigation and debate?
Here endeth the framing. Now cometh some stuff.
Derek Debus, The Death-Rattle of Journalism has a nice punchy explication which I would like to bookmark for its slightly macho posturing vibe. This tracks well with some people.
Yet its not just the rise of bad-faith shills of unreliable information that is to blame for this. It’s no secret newspapers are dying: facts expensive, opinions are cheap. So media titans started pushing more and more op-eds and making it less and less clear that they were opinions and not news. The Media became less interested in publishing accurate information and more interested in getting that sweet, sweet ad revenue.
Nathan J. Robinson, The Truth Is Paywalled But The Lies Are Free
The (now defunct) Civil tried some exotic options here.
We build tools to help readers discover and support trusted journalists around the world on a decentralized platform.
One of these tools was a blockchain instrument to publish, preserve and… judge(?) journalists. The explanation of how that latter part was supposed to work was never clear to me and I meant to look into it but they folded before I did.
Erasurebay comes from the other direction with respect to blockchainery. They are a blockchain company who wants to move into information brokerage business. Their explanation is characteristically abstruse nerdview so I can’t imagine this going far until they hire someone with a communications degree to produce comprehensible prose for them.
Ethan Zuckerman, The Case for Digital Public Infrastructure
Harnessing past successes in public broadcasting to build community-oriented digital tool
Not a complete solution, but there are some interesting ideas in scroll which bundles ad-less subscriptions to online media for readers.
The latest case study in media analysis is the COVID-19 media response. This was especially maladaptive in the USA I am told. Adam Elkus, The Fish rots from the head gives good summary/analysisfirms.
It is an exaggeration to say that fringe weirdos on social media often were more well-informed than people that exclusively evaluated mainstream sources, but not that much of an exaggeration as most would think. And that is not accidental. As Ben Thompson noted, the global COVID-19 response depended on an enormous amount of information developed and shared often in defiance of traditional media (which underrated and even mocked concern about the crisis) and even the Center for Disease Control (which attempted to suppress the critical Seattle Flu Study). The response still depends primarily on transnational networks and often must operate around rather than through official channels.
Taken together, all of this is astounding in both its scope and simultaneity. And it makes a mockery out of the cottage industry developed over the last few years to preserve our collective epistemic health.
Analysts obsessed for years and years over the threat of Russian bots and trolls and Macedonian teenagers to democratic institutions and public life, arguing that misinformation and propaganda spread via social networks would perturb the very fabric of reality and destroy the trust and cohesion necessary for liberal democracy to survive. This concern was responsive to the surface elements of deeper psychological and cultural changes, but it often was hindered by its emphasis on top-down control of computational platforms that eluded control at subjectively appropriate cost. Nonetheless, reasonable people could disagree about the response to the problem but not the actual implicit diagnosis. The diagnosis being that the unraveling of legacy institutions and their capacity to enforce at least the fiction of consensus over underlying facts and values about democratic authority was dangerous and no effort should be spared to fight it.
But as we have seen, these institutions are perfectly capable of unraveling themselves without much help from Russian bots and trolls and Macedonian teenagers. And if the fish rots from the head, then the counter-disinformation effort becomes actively harmful. It seeks to gentrify information networks that could offer layers of redundancy in the face of failures from legacy institutions. It is reliant on blunt and context-indifferent collections of bureaucratic and mechanical tools to do so. It leaves us with a situation in which complicated computer programs on enormous systems and overworked and overburdened human moderators censor information if it runs afoul of generalized filters but malicious politicians and malfunctioning institutions can circulate misleading or outright false information unimpeded. And as large content platforms are being instrumentalized by these same political and institutional entities to combat “fraud and misinformation,” this basic contradiction will continue to be heightened.
Barry Schnitt has the easy job of complaining from the outside about Facebook (therefore anything he says is ambiguously concern trolling), but his incentive framing is tidy.
It has been said that a lie gets halfway around the world before the truth has a chance to put its pants on. Now, Facebook’s speed and reach make it more like a lie circles the globe a thousand times before the truth is even awake. This is no accident. Ironically, the one true conspiracy theory appears to be that malevolent nation-states, short-sighted politicians, and misguided interest groups are using conspiracy theories to deliberately misinform the public as a means of accomplishing their long-term strategic goals. The same could be said for those deliberately using incendiary and divisive language, which is similarly allowed to propagate your system.
Unfortunately, I do not think it is a coincidence that the choices Facebook makes are the ones that allow the most content— the fuel for the Facebook engine— to remain in the system. I do not think it is a coincidence that Facebook’s choices align with the ones that require the least amount of resources, and the choices that outsource important aspects to third parties. I do not think it is a coincidence that Facebook’s choices appease those in power who have made misinformation, blatant racism, and inciting violence part of their platform. Facebook says, and may even believe, that it is on the side of free speech. In fact, it has put itself on the side of profit and cowardice.
Jonathan Stray, What tools do we have to combat disinformation?
Lead Stories uses the Trendolizer™ engine to detect the most trending stories from known fake news, satire and prank websites and tries to debunk them as fast as possible.
(I think this is publicity/loss leader for Trendolizer, a media buzz product.)
Undark is a non-profit, editorially independent digital magazine exploring the intersection of science and society. It is published with generous funding from the John S. and James L. Knight Foundation, through its Knight Science Journalism Fellowship Program in Cambridge, Massachusetts.
A recent scandal involving Undark shows why independent journalism is not enough; accountable journalism is required.
Would you like and interesting core sample of current media dynamics? Try the Slate Star Codex kerfuffle, which is an interesting case study.
Stunts with bell curves distributions.
Let’s start here with the basic thing. The (univariate) standard Gaussian pdf
\[ \psi:x\mapsto \frac{1}{\sqrt{2\pi}}\text{exp}\left(-\frac{x^2}{2}\right) \]
We define
\[ \Psi:x\mapsto \int_{-\infty}^x\psi(t) dt \]
More generally we define
\[ \phi(x; \mu ,\sigma ^{2})={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}} \]
In the multivariate case, where the covariance \(\Sigma\) is strictly positive definite we can consider density of the general normal distribution over \(\mathbb{R}^k\) as
\[ \psi({x}; \mu, \Sigma) = (2\pi )^{-{\frac {k}{2}}}\det({ {\Sigma }})^{-{\frac {1}{2}}}\,e^{-{\frac {1}{2}}( {x} -{ {\mu }})^{\!{\top}}{ {\Sigma }}^{-1}( {x} -{ {\mu }})} \]
if a random variable \(Y\) has a Gaussian distribution with parameters \(\mu, \Sigma\), we write
\[Y \sim \mathcal{N}(\mu, \Sigma)\]
This erf, or error function, is a rebranding and reparameterisation of the standard univariate normal cdf popular in computer science, to give a slightly differently ambiguous name to the already ambiguously named normal density.
But I can never remember what it is. There are scaling factors tacked on.
\[ \operatorname{erf}(x) = \frac{1}{\sqrt{\pi}} \int_{-x}^x e^{-t^2} \, dt \]
which is to say
\[\begin{aligned} \Phi(x) &={\frac {1}{2}}\left[1+\operatorname {erf} \left({\frac {x}{\sqrt {2}}}\right)\right]\\ \operatorname {erf}(x) &=2\Phi (\sqrt{2}x)-1\\ \end{aligned}\]
Done.
What have we here?
Hm. First, trivially, \(\phi'(x)=-\frac{e^{-\frac{x^2}{2}} x}{\sqrt{2 \pi }}.\)
For small \(p\), the quantile function has the useful asymptotic expansion \[ \Phi^{-1}(p) = -\sqrt{\ln\frac{1}{p^2} - \ln\ln\frac{1}{p^2} - \ln(2\pi)} + \mathcal{o}(1). \]
\[\begin{aligned} \sigma ^2 \phi'(x)+\phi(x) (x-\mu )&=0, \text{ i.e.}\\ L(x) &=(\sigma^2 D+x-\mu)\\ \end{aligned}\]
With initial conditions
\[\begin{aligned} \phi(0) &=\frac{e^{-\mu ^2/(2\sigma ^2)}}{\sqrt{2 \sigma^2\pi } }\\ \phi'(0) &=0 \end{aligned}\]
🏗 note where I learned this.
From (Steinbrecher and Shaw 2008) via Wikipedia.
Let us write \(w:=\Psi^{-1}\) to suppress keep notation clear.
\[\begin{aligned} {\frac {d^{2}w}{dp^{2}}} &=w\left({\frac {dw}{dp}}\right)^{2}\\ \end{aligned}\]
With initial conditions
\[\begin{aligned} w\left(1/2\right)&=0,\\ w'\left(1/2\right)&={\sqrt {2\pi }}. \end{aligned}\]
Botev, Grotowski, and Kroese (2010) notes
\[\begin{aligned} \frac{\partial}{\partial t}\phi(x;t) &=\frac{1}{2}\frac{\partial^2}{\partial x^2}\phi(x;t)\\ \phi(x;0)&=\delta(x-\mu) \end{aligned}\]
Look, it’s the diffusion equation of Wiener process. Surprise. If you think about this for a while you end up discovering Feynman-Kac formulate.
Univariate -
\[\begin{aligned} \left\| \frac{d}{dx}\phi_\sigma \right\|_2 &= \frac{1}{4\sqrt{\pi}\sigma^3}\\ \left\| \left(\frac{d}{dx}\right)^n \phi_\sigma \right\|_2 &= \frac{\prod_{i<n}2n-1}{2^{n+1}\sqrt{\pi}\sigma^{2n+1}} \end{aligned}\]
As made famous by Wiener processes in finance and Gaussian processes in Bayesian nonparametrics.
See, e.g. these lectures, or Michael I Jordan’s backgrounders.
Special case.
\[ Y \sim \mathcal{N}(X\beta, I) \]
implies
\[ W^{1/2}Y \sim \mathcal{N}(W^{1/2}X\beta, W) \]
For more general transforms you could try polynomial chaos.
Since Gaussian approximations pop up a lot in e.g. variational approximation problems, it is nice to know how to approximate them in probability metrics.
Useful: Two Gaussians may be related thusly in Wasserstein-2 distance, i.e. \(W_2(\mu;\nu):=\inf\mathbb{E}(\Vert X-Y\Vert_2^2)^{1/2}\) for \(X\sim\nu\), \(Y\sim\mu\).
\[\begin{aligned} d&:= W_2(\mathcal{N}(\mu_1,\Sigma_1);\mathcal{N}(\mu_2,\Sigma_2))\\ \Rightarrow d^2&= \Vert \mu_1-\mu_2\Vert_2^2 + \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]
In the centred case this is simply
\[\begin{aligned} d&:= W_2(\mathcal{N}(0,\Sigma_1);\mathcal{N}(0,\Sigma_2))\\ \Rightarrow d^2&= \operatorname{tr}(\Sigma_1+\Sigma_2-2(\Sigma_1^{1/2}\Sigma_2\Sigma_1^{1/2})^{1/2}). \end{aligned}\]
Pulled from wikipedia:
\[ D_{\text{KL}}(\mathcal{N}(\mu_1,\Sigma_1)\parallel \mathcal{N}(\mu_2,\Sigma_2)) ={\frac {1}{2}}\left(\operatorname {tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)+(\mu_{2}-\mu_{1})^{\mathsf {T}}\Sigma _{2}^{-1}(\mu_{2}-\mu_{1})-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]
In the centred case this reduces to
\[ D_{\text{KL}}(\mathcal{N}(0,\Sigma_1)\parallel \mathcal{N}(0, \Sigma_2)) ={\frac {1}{2}}\left(\operatorname{tr} \left(\Sigma _{2}^{-1}\Sigma _{1}\right)-k+\ln \left({\frac {\det \Sigma _{2}}{\det \Sigma _{1}}}\right)\right).\]
Djalil defines both Hellinger distance
\[\mathrm{H}(\mu,\nu) ={\Vert\sqrt{f}-\sqrt{g}\Vert}_{\mathrm{L}^2(\lambda)} =\Bigr(\int(\sqrt{f}-\sqrt{g})^2\mathrm{d}\lambda\Bigr)^{1/2}.\]
and Hellinger affinity
\[\mathrm{A}(\mu,\nu) =\int\sqrt{fg}\mathrm{d}\lambda, \quad \mathrm{H}(\mu,\nu)^2 =2-2A(\mu,\nu).\]
For Gaussians we can find this exactly:
\[\mathrm{A}(\mathcal{N}(m_1,\sigma_1^2),\mathcal{N}(m_2,\sigma_2^2)) =\sqrt{2\frac{\sigma_1\sigma_2}{\sigma_1^2+\sigma_2^2}} \exp\Bigr(-\frac{(m_1-m_2)^2}{4(\sigma_1^2+\sigma_2^2)}\Bigr),\]
In multiple dimensions:
\[\mathrm{A}(\mathcal{N}(m_1,\Sigma_1),\mathcal{N}(m_2,\Sigma_2)) =\frac{\det(\Sigma_1\Sigma_2)^{1/4}}{\det(\frac{\Sigma_1+\Sigma_2}{2})^{1/2}} \exp\Bigr(-\frac{\langle\Delta m,(\Sigma_1+\Sigma_2)^{-1}\Delta m)\rangle}{4}\Bigr).\]
Things that I think should be noted and filed in an orderly fashion but which I have no time to address right now. Content here will change incessantly; I would advise against linking to it.
netlify/gotell: Netlify Comments is an API and build tool for handling large amounts of comments for JAMstack products * Sin, Secret, Series A. Every startup needs to know something… | by Byrne Hobart
The brutal math of network effects is that while they’re strong once they get going, they’re hard to get going. Metcalfe’s Law overstates things by assuming that every node of a network is equally valuable, and the actual math is that the value of the network is proportionate to n*log(n). The first few users are the most valuable.…
A social media site might turn out to be the reductio ad absurdum of the brand-as-lie/lie-as-Schelling-Point phenomenon, since the entire point of user interaction on the site is to make the lie true. If a site markets itself as the place where a certain kind of cool person hangs out, and says it boldly enough to the right audience, it becomes exactly that.
A corollary to this is that for you, every social media site peaks in utility right after you join. When I was barely cool enough to qualify for Quora, Quora was pretty cool to me — but to anyone who’d been on the site for six months, Quora was a formerly cool site now populated by lamers.
William Buckner at the Human Systems and Behavior Lab based in the Department of Anthropology at Pennsylvania State University has a blog on conflict in cross-cultural perspective and other fun stuff.
Thomas Lumley visualises data pooling simply and well.
This tool lets you simulate keyboard input and mouse activity, move and resize windows, etc. It does this using X11’s XTEST extension and other Xlib functions. Additionally, you can search for windows and move, resize, hide, and modify window properties like the title. If your window manager supports it, you can use xdotool to switch desktops, move windows between desktops, and change the number of desktops.
A Wayland substitute for xdotool seems to be ydotool.
The Catalogue of Bias:
To obtain the least biased information, researchers must acknowledge the potential presence of biases and take steps to avoid and minimise their effects. Equally, in assessing the results of studies, we must be aware of the different types of biases, their potential impact and how this affects interpretation and use of evidence in healthcare decision making.
Joan Donovan, Research Director of Harvard Kennedy School’s Shorenstein Center on Media, Politics and Public Policy, How Civil Society Can Combat Misinformation and Hate Speech Without Making It Worse
Taiwan’s digital minister sounds interesting.
Anne Applebaum, Performative Authoritarianism.
COVID psychological first aid course.
China requires malware for people doing business there.
The Gumbel trick is ingenious for sampling from things that look like categorical distributions and simplices.
UTS has a podcast: 'The New Social Contract' is a podcast that examines how the relationship between universities, the state and the public might be reshaped as we live through this global pandemic.
Clearview’s Controversial Facial Recognition AI Automates Mass Surveillance.
calndr.link creates calendar links.
I built calndr.link after I had multiple clients request the exact same thing — a simple and easy way to generate calendar links, for adding to their website or in email newsletters.
There’s a few existing providers out there, but they’re extremely pricey for what they do — take some basic data (title/date/etc) and reformat it into a url for a calendar provider (be it google or apple).
Jan Turowski on Schrebergärten.
Meaningnesss, on wonder (highbrow Insane Clown Posse).
A collection of links advocating learning a new language.
Danny Dorling, Slowdown:The End of the Great Acceleration—and Why It’s Good for the Planet, the Economy, and Our Lives..
Philip Moriarty, Will Fermi and Dirac save us? Probably not.
A paper recently appeared on the arXiv with the ever-so-intriguing title of “Attacking Covid-19 with the Ising-model and the Fermi-Dirac Distribution Function”. […]I’m a big fan of the Ising model — not least because we have extensively used a variant to simulate pattern formation in nanoparticle assemblies for many* years […] Having now read the paper, it’s a little, um, underwhelming, given the rather overstated premise of the title. That’s not to say that it’s not worth reading as an example of how modelling and simulation strategies from condensed matter physics can be translated to social and epidemiological settings.
rufus is a boot drive maker which is free and open source and maintained and thus possibly less suspect than some competitors.
Terry Tao’s course in Modern real-variable harmonic analysis.
I keep meaning to bookmark entertaining contrarian Cory Clark.
Parsing text tags into Boolean feature vectors for tensorflow
I would like to read the Kernelized Stein Discrepancy tutorial.
China launches national blockchain.
bitcoin during financial crisis.
This guide to pruning multihead attention NN should probably go somewhere useful if i actually end up doing NLP like all the recruiters seem to want.
Vlad Alex (Merzmensch) 12 Colab Notebooks that matter StyleGAN, GPT-2, StyleTransfer, DeOldify, Magenta etc to try out.
Glitter nail polish for laptop security.
I should lobby citation styles for URL support to prevent the entire project being quaintly mired in woodpulp. Here is where to do it.
Francis Bach calls the continuous approximation to a discrete time optimization a gradient flow. Which of the other uses I know is this the same as?
Statistical Inference via Convex Optimization.
Conjugate functions illustrated.
Francis Bach on the use of geometric sums and a different take by Julyan Arbel.
Tutorial to approximating differentiable control problems. An extension of this is universal differential equations.
https://djtechtools.com/2020/07/14/best-ai-platforms-to-help-you-make-music/
https://www.patreon.com/loudlystudio
Open AI Jukebox is the latest hot generative music thing that I should be across. I would personally take a rather different approach to them to solve this problem, but they are the current benchmark.
I’ve a weakness for ideas that give me plausible deniability for making generative art while doing my maths homework.
NB : this is a fast moving field and I am not staying up to speed with it.
This page is more chaotic than the already-chaotic median, sorry. Good luck making sense of it.
See also analysis/resynthesis.
See gesture recognition. Oh and also google’s AMI channel, and ml4artists, which has some sweet machine learning for artists topic guides.
Many neural networks, are generative in the sense that even if you train ’em to classify things, they can also predict new members of the class. e.g. run the model forwards, it recognizes melodies; run it “backwards”, it composes melodies. Or rather, you maybe trained them to generate examples in the course of training them to detect examples.
There are many definitional and practical wrinkles here, and this quality is not unique to artificial neural networks, but it is a great convenience, and the gods of machine learning have blessed us with much infrastructure to exploit this feature, because it is close to actual profitable algorithms. Upshot: There is now a lot of computation and grad student labour directed at producing neural networks which as a byproduct can produce faces, chairs, film dialogue, symphonies and so on.
There are NeurIPS streams about this now.
Some as-yet-unfiled neural-artwork links I should think about.
See those classic images from google’s tripped-out image recognition systems) or Gatys, Ecker and Bethge’s deep art Neural networks do a passable undergraduate Monet.
https://deepdreamgenerator.com/
https://www.artbreeder.com/start
Here’s Frank Liu’s implementation of style transfer in pycaffe.
Alex Graves, Generating Sequences With Recurrent Neural Networks, generates handwriting. Relatedly, sketch-rnn is reaaaally cute.
Deep dreaming approaches are entertaining. (NSFW) Here’s a more pedestrian and slightly more informative version of that.
Distill.pub has some lovely visual explanations of visual and other neural networks.
hardmaru presents an amazing introduction to running sophisticated neural networks in the browser, targeted at artists, which goes over the handwriting post in a non-technical way.
progressive_growing_of_gans is neat, generating infinite celebrities at high resolution. (Karras et al. 2017)
a platform for creators of all kinds to use machine learning tools in intuitive ways without any coding experience. Find resources here to start creating with RunwayML quickly.
In particular it plugs into Blender and Photoshop and allows you to use those programs as a UI for ML-backed algorithms. Nice.
jax
(python) is a successor to classic python/numpy autograd
.
It includes various code optimisation, jit-compilations, differentiating and vectorizing.
So, a numerical library with certain high performance machine-learning affordances. Note, it is not a deep learning framework per se, but rather the producer species at lowest trophic level of a deep learning ecosystem. For information frameworks built upon it, read on to later sections.
The official pitch:
JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.
What’s new is that JAX uses XLA to compile and run your NumPy programs on GPUs and TPUs. Compilation happens under the hood by default, with library calls getting just-in-time compiled and executed. But JAX also lets you just-in-time compile your own Python functions into XLA-optimized kernels using a one-function API,
jit
. Compilation and automatic differentiation can be composed arbitrarily, so you can express sophisticated algorithms and get maximal performance without leaving Python.Dig a little deeper, and you’ll see that JAX is really an extensible system for composable function transformations. Both
grad
andjit
are instances of such transformations. Another isvmap
for automatic vectorization, with more to come.This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying it out, reporting bugs, and letting us know what you think!
AFAICT the conda installation command is
conda install -c conda-forge jaxlib
You don’t know jax is a popular intro.
It has idioms that are not obvious. For me it was not clear how to use batch vectorizing and functional-style application of structures:
One thing I see often in examples is
from jax.config import config
config.enable_omnistaging()
Do I need to care about it? tl;dr omnistaging is good and necessary and also switched on by default on recent jax, so that line is simply being careful and likely unneeded.
Over at Deepmind there is Haiku, which looks nifty. Its documentation is much more complete than flax so I will be auditioning it for my next project.
Related, at least organisationally, is rlax the jax reinforcement-learning library from the same company.
Flax is I think the de facto standard deep learning library. Documentation is sparse, but a recent design doc seems canonical. The documentation is not especially coherent (e.g. Why do some modules assume batching and other not? No hints) but it more or less can be cargo-culted and you can ignore the quirks.
See also the following WIP documentation notebooks
Those answered some questions, but I still have questions left over due to various annoying rough edges and non-obvious gotchas.
For example, if you miss a parameter needed for a given model, the error is
FilteredStackTrace: AssertionError: Need PRNG for "params"
.
There are some good examples in the repository.
With those caveats about documentation, flax is still not bad because the underlying jax debugging experience is transparent and easy. This is still an OK option.
Numpyro seems to be the dominant probabilistic programming system. It is a jax port/implementation/something of the pytorch classic, Pyro.
More fringe but possibly interesting, jax-md does molecular dynamics. ladax “LADAX: Layers of distributions using FLAX/JAX” does some kind of latent RV something.
The creators of Stheno eem to be Invenia, some of whose staff I am connected to in various indirect ways. It targets jax as one of several backends via a generic backend library, wesselb/lab: A generic interface for linear algebra backends.
Placeholder; details TBD.
Causality on continuous index spaces, and, which turns out to be related, equilibrium dynamics. Placeholder.
Placeholder.
Terry Tao on Conversions between standard polynomial bases.
Xiu and Karniadakis (2002) mention the following “Well known facts”:
All orthogonal polynomials \(\left\{Q_{n}(x)\right\}\) satisfy a three-term recurrence relation \[ -x Q_{n}(x)=A_{n} Q_{n+1}(x)-\left(A_{n}+C_{n}\right) Q_{n}(x)+C_{n} Q_{n-1}(x), \quad n \geq 1 \] where \(A_{n}, C_{n} \neq 0\) and \(C_{n} / A_{n-1}>0 .\) Together with \(Q_{-1}(x)=0\) and \(Q_{0}(x)=1,\) all \(Q_{n}(x)\) can be determined by the recurrence relation.
It is well known that continuous orthogonal polynomials satisfy the second-order differential equation \[ s(x) y^{\prime \prime}+\tau(x) y^{\prime}+\lambda y=0 \] where \(s(x)\) and \(\tau(x)\) are polynomials of at most second and first degree, respectively, and \[ \lambda=\lambda_{n}=-n \tau^{\prime}-\frac{1}{2} n(n-1) s^{\prime \prime} \] are the eigenvalues of the differential equation; the orthogonal polynomials \(y(x)=\) \(y_{n}(x)\) are the eigenfunctions.
This list is extracted from a few places including Xiu and Karniadakis (2002).
Family | Orthogonal in measure | |
---|---|---|
Monomial | n/a | |
Bernstein | n/a | |
Legendre | \(\operatorname{Unif}([-1,1])\) | |
Hermite | \(\mathcal{N}(0,1)\) | |
Laguerre | \(x^{\alpha}\exp -x, \, x>0\) | |
Jacobi | \((1-x)^{\alpha }(1+x)^{\beta }\) on \([-1,1]\) | |
Charlier | Poisson distribution | |
Meixner | negative binomial distribution | |
Krawtchouk | binomial distribution | |
Hahn | hypergeometric distribution | |
??? | Unit ball |
In which I think about parameterisations and implementations of finite dimensional energy-preserving operators, a.k.a. matrices. A particular nook in the the linear feedback process library, closely related to stability in linear dynamical systems, since every orthonormal matrix is the forward operator of stable (in a certain sense) system.
Uses include maintaining stable gradients in recurrent neural networks (Arjovsky, Shah, and Bengio 2016; Jing et al. 2017; Mhammedi et al. 2017) and efficient normalising flows. (Berg et al. 2018; Hasenclever, Tomczak, and Welling 2017)
Also, parameterising stable Multi-Input-Multi-Output (MIMO) delay networks in signal processing.
There is some terminological work. Some writers refer to orthogonal matrices (but I prefer that to mean matrices where the columns are not all length 1), and some refer to unitary matrices, which seems to imply the matrix is over the complex field instead of the reals but is basically the same from my perspective.
Finding an orthonormal matrix is equivalent to choosing a finite orthonormal basis, so any way we can parameterise such a basis gives us an orthonormal matrix.
Citing MATLAB, Nick Higham gives the following two parametric families orthonormal matrices. These are clearly far from coering the sholw space of orthonomral matrices.
\[ q_{ij} = \displaystyle\frac{2}{\sqrt{2n+1}}\sin \left(\displaystyle\frac{2ij\pi}{2n+1}\right) \]
\[ q_{ij} = \sqrt{\displaystyle\frac{2}{n}}\cos \left(\displaystyle\frac{(i-1/2)(j-1/2)\pi}{n} \right) \]
Another one: the matrix exponential of a skew-symmetric matrices is orthogonal. If \(A=-A^{T}\) then \[ \left(e^{A}\right)^{-1}=\mathrm{e}^{-A}=\mathrm{e}^{A^{T}}=\left(\mathrm{e}^{A}\right)^{T} \]
Have a nearly orthonormal matrix? Berg et al. (2018) gives us a contraction which gets you closer to an orthonormal matrix:
\[ \mathbf{Q}^{(k+1)}=\mathbf{Q}^{(k)}\left(\mathbf{I}+\frac{1}{2}\left(\mathbf{I}-\mathbf{Q}^{(k) \top} \mathbf{Q}^{(k)}\right)\right) \] which reputedly converges if
\(\left\|\mathbf{Q}^{(0) \top} \mathbf{Q}^{(0)}-\mathbf{I}\right\|_{2}<1\)
(attributed to (Björck and Bowie 1971; Kovarik 1970))
We can apply successive reflections about hyperplanes, the so called Householder reflections, to an identity matrix to construct a new one.
\[ H(\mathbf{z})=\mathbf{z}-\frac{\mathbf{v} \mathbf{v}^{T}}{\|\mathbf{v}\|^{2}} \mathbf{z} \] 🏗
One obvious method for constructing unitary matrices is composing Givens rotations. 🏗
Nick Higham has a compact introduction to random orthonormal matrices and especially the Haar measure, which is a distribution over such matrices with natural invariances.
My own git notes. See also the more universally acclaimed classic git tips.
See the fastai masterclass for many more helpful tips/links/scripts/recommendations.
During a merge, git checkout --theirs filename
(or --ours
) will checkout respective their, or my version.
The following sweet hack fixes will resolve all files accordingly:
grep -lr '<<<<<<<' . | xargs git checkout --theirs
TODO: I can find conflicted files using git natively without grep. Should look it up.
Easy, except for the abstruse naming;
It is called “pickaxe” and spelled -S
.
git log -Sword
git clone --single-branch --branch <branchname> <remote-repo>
git rm --cached blah.tmp
.DS_Store
filesecho .DS_Store >> .gitignore_global
git config --global core.excludesfile $HOME/.gitignore_global
git push <remote_name> --delete <branch_name>
git push origin HEAD:refs/heads/backdoor
This is almost obvious except the git naming of things seem arbitrary.
Why refs/heads/SOMETHING
? Well…
By which I mean that which is formally referred to as git references.
git references is the canonical description of the mechanics here.
tl;dr the most common names are refs/heads/SOMETHING
for branch SOMETHING
, refs/tags/SOMETHING
and remotes/SOMEREMOTE/SOMETHING
for (last known state of) a remote branch.
As alexwlchan explains, these references are friendly names for commits.
The uses are (at least partly) convention and other references can be used too.
For example gerrit
uses refs/for/
for code review purposes.
Commands applied to your files on the way in and out of the repository.
Keywords, smudge
, clean
, .gitattr
These are a long story, but not so complicated in practice.
A useful one is stripping crap from jupyter notebooks.
For doing stuff before you put it in cold storage. For me this means, e.g asking DID YOU REALLY WANT TO INCLUDE THAT GIANT FILE?
Here is a commit hook that does exactly that. I made a slightly modernized version:
curl -L https://gist.github.com/danmackinlay/6e4a0e5c38a43972a0de2938e6ddadba/raw/install.sh | bash
After that installation you can retrofit the hook to an existing repository thusly
p -R ~/.git_template/hooks .git/
There are various frameworks for managing hooks, if you have lots.
For example,
pre-commit is a mini-system for managing git hooks, based on python.
Husky is a node.js
-based one.
I am not sure whether hook management system actually save time overall for a solo developer, since the kind of person who remembers to install a pre-commit hock is also the kind of person who is relatively less likely to need one. Also it is remarkably labour-intensive to install the dependencies for all these systems, so if you are using heterogeneous systems this becomes tedious.
Sub-projects inside other projects? External projects? The simplest way of integrating external projects is as subtrees. Once this is set up you can mostly ignore them. Alternatively there are submodules, which have various complications.
Alternatively there is the subtrac
system, which I have not yet used.
Creatin’:
git fetch remote branch
git subtree add --prefix=subdir remote branch --squash
Updatin’:
git fetch remote branch
git subtree pull --prefix=subdir remote branch --squash
git subtree push --prefix=subdir remote branch --squash
Con: Rebasin’ with a subtree in your repo is slow and involved.
Use subtree split
to prise out one chunk. It
has some wrinkles
but is fast and easy.
pushd superproject
git subtree split -P project_subdir -b project_branch
popd
mkdir project
pushd project
git init
git pull ../superproject project_branch
Alternatively, to comprehensively rewrite history to exclude everything outside a subdir:
pushd superproject
cd ..
git clone superproject subproject
pushd subproject
git filter-branch \
--subdirectory-filter project_subdir \
--prune-empty -- \
--all
Include external projects as separate repositories within a repository is also possible, but I won’t document it here, since it’s well documented elsewhere, and I use it less. NB: much discipline required to make it go.
Have not yet tried.
subtrac
is a helper tool that makes it easier to keep track of your git submodule contents. It collects the entire contents of the entire history of all your submodules (recursively) into a separate git branch, which can be pushed, pulled, forked, and merged however you want.
This works for github at least. I think anything running git-svn
?
svn co https://github.com/buckyroberts/Source-Code-from-Tutorials/trunk/Python
gerrit
Gerrit is a code review system for git.
legit
legit
simplifies feature branch workflows.
rerere
Not repeating yourself during merges? git rerere automates this:
git config --global rerere.enabled true
git config --global rerere.autoupdate true
git checkout my_branch -- my_file/
In brief, this will purge a lot of stuff from a constipated repo in emergencies:
git reflog expire --expire=now --all && git gc --prune=now
bfg
does that:
git clone --mirror git://example.com/some-big-repo.git
java -jar bfg.jar --strip-blobs-bigger-than 10M some-big-repo.git
cd some-big-repo.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push
I think bfg
also does this. There is also native support:
git filter-branch -f \
--index-filter
'git rm -r --cached --ignore-unmatch unwanted_files'
e.g. you are editing a git repo on NTFS via Linux and things are silly.
git config core.filemode false
if output=$(git status --porcelain) && [ -z "$output" ]; then
# Working directory clean
else
# Uncommitted changes
fi
Oh crap I’m leaving the office in a hurry and I just need to get my work into git ASAP for continuing on another computer. I don’t care about sensible commit messages because I am on my own private branch and no-one else will see them when I squash the pull request.
I put this little script in a file called gitbang
to automate the this case.
#!/usr/bin/env bash
# I’m leaving the office. Capture all changes in my private branch and push to server.
if output=$(git status --porcelain) && [ -z "$output" ]; then
echo "nothing to commit"
else
git add --all && git commit -m bang
fi
git pull && git submodule update --init --recursive && git push
Tools such as git-latexdiff provide custom diffing for, in this case, LaTeX
code.
These need to be found on a case-by-case basis.
Managing SSH credentials in git is non-obvious. See SSH.
For sanity in git+jupyter, see jupyter
.
See Git GUIs.
For fish
and bash
shell, see
bash-git-prompt.
See data versioning.
The intersection of linear dynamical systems and stability of dynamic systems.
There is not much content here because I spent 2 years working on it and am too traumatised to revisit it.
Informally, I am admitting as “stable” any dynamical system which does not explode super-polynomially fast; We can think of these as systems where if the system is not stationary then at least the rate of change might be.
Energy-preserving systems are a special case of this.
There are many problems I am interested in that touch upon this.
In the univariate, discrete-time case, in discrete-time linear systems terms, these are systems that have no poles outside the unit circle, but might have poles on the unit circle. In continuous time it is about systems that have no poles with positive real part. For finitely realizable systems this boils down to tracking trigonometric roots, is Megretski (2003).
A more general trick for, e.g. multivariate functions is reparameterisation. This Betancourt podcast on Sarah Heaps’ paper [HeapsEnforcing2020] on parameterising stationarity in vector auto regressions is deep and IMO points the way to some other neat tricks in neural nets. She constructs interesting priors for this case, using some reparametrisations by Ansley and Kohn (1986).
TBC.
What if we are incrementally learning a system and wish the gradient descent steps not to push it away from stability? In such a case, we can possibly side-step the problem by using a topology which maximises system stability (Laroche 2007).
Are you an inveterate remixer? Here I note repositories of legally available content for remixing and mashing up. (Check for remix rights in your local jurisdiction.)
This is what I actually do with my life, so I know too much of it to fit here. See sample libraries, musical corpora.
In descending order of addictiveness:
One of the many wonderful features of the the Internet Public Library is not its browsing page, which is a mess. I laboriously opted in to all the old (hopefully uncopyrighted) books by clicking checkboxes.
Comments
Comments syntax:
This is not obvious because they have a literate programming thing going no where some cells can be text cells; which is fine but sometime I need inline comments.