(Kernelized) Stein variational gradient descent

KSD, SVGD, other computational Stein discrepancy methods

2022-11-02 — 2024-05-30

approximation

Bayes

functional analysis

Markov processes

measure

metrics

Monte Carlo

optimization

particle

probabilistic algorithms

probability

score function

statistics

Suspiciously similar content

Figure 2: Q. Liu (2016a)’s diagram of the relationships between this method, kernel methods, Stein’s method, variational inference, maximum mean discrepancy and Fisher information. Bonus points for coining *Steinalization*.

Stein’s method meets variational inference via kernels and probability measures. The result is a method of inference that maintains an ensemble of particles which notionally collectively sample from some target distribution. I should learn about this, as one of the methods I might use for low-assumption Bayes inference. This seems to have been invented in Q. Liu, Lee, and Jordan (2016) and Chwialkowski, Strathmann, and Gretton (2016), weaponized in Q. Liu (2016b).

There seems to be a standard way of introducing the tools, which I find very confusing. Here I work through that standard with laborious worked examples, so that I can internalize the necessary intuitions for all this.

For a more comprehensive introduction (albeit brusquer), see Anastasiou et al. (2023), which combines a whole bunch of recent developments with consistent notation.

For what it is worth, I found Chwialkowski, Strathmann, and Gretton (2016) to be the easiest read, although none of them was pedagogically ideal, which is why I wrote this note.

Let us introduce the bits we need.

1 Stein operators

We start with the classic Stein’s identity which turns out to be a useful trick for quantifying how well we have approximated some density.

Spoiler: later on it turns out that we can even use this as a target loss in order to improve how well we have approximated some density.

We care about a target density $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ and another density $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ (which will end up approximating it). $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ needs to be differentiable for this to work. They are both densities over, by assumption $X \subseteq R d <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">X</mi></mrow><mo>\subseteq</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup></math>$ . We also introduce a family of $R d <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup></math>$ to $R d <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup></math>$ test functions $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ . We require that $lim x \to \pm \infty p (x b) f (x b) = 0 <math xmlns="http://www.w3.org/1998/Math/MathML"><munder><mo data-mjx-texclass="OP" movablelimits="true">lim</mo><mrow data-mjx-texclass="ORD"><mi>x</mi><mo accent="false" stretchy="false">\to</mo><mo>\pm</mo><mi mathvariant="normal">\infty</mi></mrow></munder><mi>p</mi><mo stretchy="false">(</mo><mi>x</mi><mi>b</mi><mo stretchy="false">)</mo><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mi>b</mi><mo stretchy="false">)</mo><mo>=</mo><mn>0</mn></math>$ for $∥ b ∥ = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo data-mjx-texclass="ORD">∥</mo><mi>b</mi><mo data-mjx-texclass="ORD">∥</mo><mo>=</mo><mn>1</mn></math>$ , and some other stuff which we get to in a moment.

Should say more about the generic class $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ . Q. Liu, Lee, and Jordan (2016) does.

Next, we choose a Stein operator $A x : F \to G <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub><mo>:</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></math>$ . $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ and $G <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">G</mi></mrow></math>$ are spaces of functions from $R d <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup></math>$ to $R <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ . I gave them different names because it is not clear to me that they are necessarily the same space, but we can probably ignore that detail for now. We do not go into the details of the requirements of the spaces, but they should be smooth functions that are square-integrable (with respect to the target distribution $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ ? or Lebesgue measure? something else?) whose derivatives also go to zero at infinity. For an operator $A x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub></math>$ to be a Stein operator for a target distribution $p (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ , it must satisfy the following key property:

Stein’s Operators: $A x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub></math>$ is a Stein operator with respect to a suitable class of test functions $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ , and target distribution $p (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ if the expectation of $A x f (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ under $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ is zero, i.e. for all those $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ in $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ ,

$E X \sim p [A X f (X)] = 0.$

$\begin{matrix} (1) & E_{X \sim p} [A_{X} f (X)] = 0. \end{matrix}$

For $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ which include a non-trivial linear subspace, we can see that $A x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub></math>$ must be linear, because expectation is linear, and otherwise we could make linear changes to an $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ and end up violating the equality.

A popular choice is to set the Stein Operator to $A x f (x) := f (x) \nabla x \cdot log p (x) + \nabla x \cdot f (x) .$

$\begin{matrix} (2) & A_{x} f (x) := f (x) \nabla_{x} \cdot \log p (x) + \nabla_{x} \cdot f (x) . \end{matrix}$

Anastasiou et al. (2023) call this the Langevin Stein Operator, and it seems that if you do not otherwise specify, this is the one you get. The Langevin Stein operator makes this into a score-based method — See C. Liu et al. (2019) for some deep theory about that.

Example time! Let us make this more concrete, by choosing a specific $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ which is not trivial but not baffling either. I reckon a 2d Gaussian with standard deviation 1 and mean 0 will do the trick. Let us give it a correlation $ρ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ρ</mi></math>$ , which we leave unspecified, to keep things spicy. This implies mean $μ = (0, 0) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi><mo>=</mo><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>0</mn><mo stretchy="false">)</mo></math>$ and covariance $Σ = [1 ρ ρ 1] <math xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">Σ</mi><mo>=</mo><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable data-mjx-smallmatrix="true" columnspacing="0.333em" rowspacing=".2em"><mtr><mtd><mn>1</mn></mtd><mtd><mi>ρ</mi></mtd></mtr><mtr><mtd><mi>ρ</mi></mtd><mtd><mn>1</mn></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$ , and thence inverse covariance $Σ−1=11−ρ2[1−ρ−ρ1]<math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mo>−</mo><mn>1</mn></mrow></msup><mo>=</mo><mstyle displaystyle="false" scriptlevel="0"><mfrac><mn>1</mn><mrow><mn>1</mn><mo>−</mo><msup><mi>ρ</mi><mn>2</mn></msup></mrow></mfrac></mstyle><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><mtable data-mjx-smallmatrix="true" columnspacing="0.333em" rowspacing=".2em"><mtr><mtd><mn>1</mn></mtd><mtd><mo>−</mo><mi>ρ</mi></mtd></mtr><mtr><mtd><mo>−</mo><mi>ρ</mi></mtd><mtd><mn>1</mn></mtd></mtr></mtable><mo data-mjx-texclass="CLOSE">]</mo></mrow></math>$ . The pdf for this distribution is $p (x) = 1 2 π \sqrt | Σ | exp (- 12 (x - μ) ⊤ Σ - 1 (x - μ)) = 1 2 π \sqrt 1 - ρ 2 exp (- 12 [x 1 x 2] 1 1 - ρ 2 [1 - ρ - ρ 1] [x 1 x 2]) = 1 2 π \sqrt 1 - ρ 2 exp (- 1 2 (1 - ρ 2) (x 21 - 2 ρ x 1 x 2 + x 22)) .$

$\begin{aligned} p (x) & = \frac{1}{2 π \sqrt{| Σ |}} \exp (- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ)) \\ = \frac{1}{2 π \sqrt{1 - ρ^{2}}} \exp (- \frac{1}{2} [\begin{array}{c} x_{1} & x_{2} \end{array}] \frac{1}{1 - ρ^{2}} [\begin{array}{c} 1 & - ρ \\ - ρ & 1 \end{array}] [\begin{array}{c} x_{1} \\ x_{2} \end{array}]) \\ = \frac{1}{2 π \sqrt{1 - ρ^{2}}} \exp (- \frac{1}{2 (1 - ρ^{2})} (x_{1}^{2} - 2 ρ x_{1} x_{2} + x_{2}^{2})) . \end{aligned}$

We can simplify the Langevin Stein operator for this choice of $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ , since $\nabla x log p (x) = \nabla x log (1 2 π \sqrt 1 - ρ 2 exp (- 1 2 (1 - ρ 2) (x 21 - 2 ρ x 1 x 2 + x 22))) = \nabla x (- 1 2 (1 - ρ 2) (x 21 - 2 ρ x 1 x 2 + x 22)) = - 1 2 (1 - ρ 2) \nabla x (x 21 - 2 ρ x 1 x 2 + x 22) = 1 1 - ρ 2 [x 1 - ρ x 2 x 2 - ρ x 1]$

$\begin{aligned} \nabla_{x} \log p (x) & = \nabla_{x} \log (\frac{1}{2 π \sqrt{1 - ρ^{2}}} \exp (- \frac{1}{2 (1 - ρ^{2})} (x_{1}^{2} - 2 ρ x_{1} x_{2} + x_{2}^{2}))) \\ = \nabla_{x} (- \frac{1}{2 (1 - ρ^{2})} (x_{1}^{2} - 2 ρ x_{1} x_{2} + x_{2}^{2})) \\ = - \frac{1}{2 (1 - ρ^{2})} \nabla_{x} (x_{1}^{2} - 2 ρ x_{1} x_{2} + x_{2}^{2}) \\ = \frac{1}{1 - ρ^{2}} [\begin{array}{c} x_{1} - ρ x_{2} \\ x_{2} - ρ x_{1} \end{array}] \end{aligned}$

Equation 1 then comes out to

$E X \sim p [A x f (X)] = E X \sim p [f (x) \nabla x \cdot log p (x) + \nabla x \cdot f (x)] = E X \sim p [f (X) 1 - ρ 2 (X 1 - ρ X 2 + X 2 - ρ X 1) + \partial X 1 f (X) + \partial X 2 f (X)] = E X \sim p [f (X) (1 - ρ) 1 - ρ 2 (X 1 + X 2) + \partial X 1 f (X) + \partial X 2 f (X)] = E X \sim p [f (X) X 1 + X 2 1 + ρ + \partial X 1 f (X) + \partial X 2 f (X)]$

$\begin{aligned} E_{X \sim p} [A_{x} f (X)] & = E_{X \sim p} [f (x) \nabla_{x} \cdot \log p (x) + \nabla_{x} \cdot f (x)] \\ = E_{X \sim p} [\frac{f (X)}{1 - ρ^{2}} (X_{1} - ρ X_{2} + X_{2} - ρ X_{1}) + \partial_{X_{1}} f (X) + \partial_{X_{2}} f (X)] \\ = E_{X \sim p} [\frac{f (X) (1 - ρ)}{1 - ρ^{2}} (X_{1} + X_{2}) + \partial_{X_{1}} f (X) + \partial_{X_{2}} f (X)] \\ = E_{X \sim p} [f (X) \frac{X_{1} + X_{2}}{1 + ρ} + \partial_{X_{1}} f (X) + \partial_{X_{2}} f (X)] \end{aligned}$

We can choose some simple $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ for the purposes of intuition building, e.g. the linear set $ $F := {f (x) = a x 1 + b x 2 + c; a, b, c \in R} <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow><mo>:=</mo><mo fence="false" stretchy="false">{</mo><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mi>a</mi><msub><mi>x</mi><mn>1</mn></msub><mo>+</mo><mi>b</mi><msub><mi>x</mi><mn>2</mn></msub><mo>+</mo><mi>c</mi><mo>;</mo><mi>a</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>c</mi><mo>\in</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mo fence="false" stretchy="false">}</mo></math>$ — we would normally use something a bit more interesting. The expectation of the Stein operator for our bivariate Gaussian for this function class is then

$E X \sim p [A x f (X)] = E X \sim p [f (X) X 1 + X 2 1 + ρ + \partial X 1 f (X) + \partial X 2 f (X)] = E X \sim p [(a X 1 + b X 2) X 1 + X 2 1 + ρ + \partial X 1 (a X 1 + b X 2) + \partial X 2 (a X 1 + b X 2)] = E X \sim p [(a X 1 + b X 2) X 1 + X 2 1 + ρ + a + b] = E X \sim p [(X 1 + X 2) (a X 1 + b X 2) 1 + ρ + a + b]$

$\begin{aligned} E_{X \sim p} [A_{x} f (X)] & = E_{X \sim p} [f (X) \frac{X_{1} + X_{2}}{1 + ρ} + \partial_{X_{1}} f (X) + \partial_{X_{2}} f (X)] \\ = E_{X \sim p} [(a X_{1} + b X_{2}) \frac{X_{1} + X_{2}}{1 + ρ} + \partial_{X_{1}} (a X_{1} + b X_{2}) + \partial_{X_{2}} (a X_{1} + b X_{2})] \\ = E_{X \sim p} [(a X_{1} + b X_{2}) \frac{X_{1} + X_{2}}{1 + ρ} + a + b] \\ = E_{X \sim p} [\frac{(X_{1} + X_{2}) (a X_{1} + b X_{2})}{1 + ρ} + a + b] \end{aligned}$

Phew! OK that worked. I itch to plot these functions; I think there are two qualities of interest; the first is the function $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ itself, and the second is the Stein operator applied to $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ weighted by the density $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ .

Code

import jax.numpy as jnp
import numpy as np
from jax import grad, vmap
import plotly.graph_objects as go
import plotly.io as pio

pio.templates.default = "none"

# Plot params
n = 61

# Define the scalar function f
def f(x, a, b, c):
    return a * x[..., 0] + b * x[..., 1] + c

# Define the log-density of the generic probability density function p
def log_p(x, rho):
    return -0.5 * (
        x[..., 0]**2 + x[..., 1]**2
         - 2 * rho * x[..., 0] * x[..., 1]
    ) / (
        1 - rho**2
    ) - jnp.log(
        2 * jnp.pi * jnp.sqrt(1 - rho**2)
    )

# Define the Stein operator applied to some f and p
def A_x_f(x, f, log_p):
    grad_log_p = vmap(grad(log_p))(x)
    grad_f = vmap(grad(f))(x)
    return grad_log_p.sum(axis=-1) * f(x) + grad_f.sum(axis=-1)

# Fix specific values for rho and the parameters of f
rho = 0.3
a, b, c = 0.4, 0.25, -0.5
x1min, x1max = -3, 3
x2min, x2max = -3, 3

f_specific = lambda x: f(x, a, b, c)
log_p_specific = lambda x: log_p(x, rho)

# Create a grid of points
x1, x2 = np.meshgrid(
    np.linspace(x1min, x1max , n, endpoint=True),
    np.linspace(x2min, x2max, n, endpoint=True)
)
x = np.stack([x1, x2], axis=-1).reshape(-1, 2)

# Compute the function f at each point in x
f_x = f_specific(x).reshape(x1.shape)
p_x = np.exp(log_p_specific(x)).reshape(x1.shape)

# Compute the Stein operator for f at each point in x
A_x_f_x = A_x_f(x, f_specific, log_p_specific).reshape(x1.shape)
p_A_x_f_x = A_x_f_x * p_x

# Determine the z range with a margin
z_min = np.min(p_A_x_f_x) - 0.1
z_max = np.max(p_A_x_f_x) + 0.1

# Create the 3D surface plot
fig = go.Figure()

# Add the surface plot for the Stein operator coloured by the density p_x
fig.add_trace(
    go.Surface(
        z=p_A_x_f_x,
        x=x1,
        y=x2,
        surfacecolor=p_x,
        colorscale='Viridis',
        showscale=False,  # Remove the colour bar
        opacity=0.9,  # slightly transparent
        name='<i>p A<sub>x</sub> f</i>'  # Add name for legend
    )
)

# Add the contour plot for f on the same axes, with a different colour scheme and semi-transparent
fig.add_trace(
    go.Surface(
        z=f_x,
        x=x1,
        y=x2,
        colorscale='Cividis',
        showscale=False,
        opacity=0.5,  # make this semi-transparent
        name='f',  # Add name for legend
        contours={
            "z": {
                "show": True,
                "start": np.min(f_x),
                "end": np.max(f_x),
                "size": (np.max(f_x) - np.min(f_x)) / 10,
                "color":"white",
            }
        }
    )
)

# Set the layout with an initial camera view closer to the z=0 plane
fig.update_layout(
    title='<i>p A<sub>x</sub> f</i>&nbsp;and&nbsp;<i>f</i>',
    scene=dict(
        xaxis=dict(title='x<sub>1</sub>'),
        yaxis=dict(title='x<sub>2</sub>'),
        zaxis=dict(
            # title='p A<sub>x</sub> f, f',
            range=[z_min, z_max]),
        camera=dict(
            eye=dict(x=1.25, y=-1.25, z=0.5)  # Lower down closer to the z=0 plane
        )
    ),
    width=800,
    height=800,
    font=dict(family="Alegreya, serif"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    ## legends don’t work on 3d contours
    # showlegend=True,  # Show legend
    # legend=dict(
    #     x=0.02,  # Position the legend on the left
    #     y=0.98,
    #     bgcolor='rgba(255,255,255,0.7)',  # Semi-transparent background for better visibility
    #     bordercolor='Black',
    #     borderwidth=1
    # )
)
# Show the plot
fig.show()

(a) A sorta-kinda visualization of the Stein operator for a 2d Gaussian and a linear test function. The translucent, grey one is the function

f (x) = 0.4 x 1 + 0.25 x 2 - 0.5 <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><mn>0.4</mn><msub><mi>x</mi><mn>1</mn></msub><mo>+</mo><mn>0.25</mn><msub><mi>x</mi><mn>2</mn></msub><mo>-</mo><mn>0.5</mn></math>

, and the blue-green one is

p (x) A x f (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><msub><mi>A</mi><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>

(b)

Figure 3

Did that help us? Well, kinda. It is not really clear to me that I should trust that the second figure should actually integrate to 0. Did it?

p_A_x_f_x.sum().item()*(x1max-x1min)*(x2max-x2min)/(p_A_x_f_x.size)

0.01856179405239057

Hm, not convincingly exactly 0, but not so far off that we cannot persuade ourselves that it is simply a truncation problem.

2 Stein discrepancy

We make Equation 1 into a quantity that depends on two, potentially-different densities by taking the expectation over a different density $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ than the one that generated the operator $A x, <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub><mo>,</mo></math>$ and seeing if that does something useful:

$E x \sim q [A x f (X)] = 0$

$\begin{matrix} (3) & E_{x \sim q} [A_{x} f (X)] = 0 \end{matrix}$

Spoiler: it turns out that this does do something useful.

$E x \sim q [A x f (X)] = E x \sim q [A x f (X)] - = 0 ⏞ E x \sim q [A q f (X)] = E x \sim q [f (x) \cdot \nabla x log p (x) + \nabla x \cdot f (x) - f (x) \cdot \nabla x log p (x) - \nabla x \cdot f (x)] = E x \sim q [f (x) \cdot \nabla x log p (x) - f (x) \cdot \nabla x log q (x)] = E x \sim q [f (x) δ p, q (x)]$

$\begin{aligned} E_{x \sim q} [A_{x} f (X)] & = E_{x \sim q} [A_{x} f (X)] - \overset{= 0}{\overset{⏞}{E_{x \sim q} [A_{q} f (X)]}} \\ = E_{x \sim q} [f (x) \cdot \nabla_{x} \log p (x) + \nabla_{x} \cdot f (x) \\ - f (x) \cdot \nabla_{x} \log p (x) - \nabla_{x} \cdot f (x)] \\ = E_{x \sim q} [f (x) \cdot \nabla_{x} \log p (x) - f (x) \cdot \nabla_{x} \log q (x)] \\ = E_{x \sim q} [f (x) δ_{p, q} (x)] \end{aligned}$ where

δ p, q (x) := \nabla x log p (x) - \nabla x log q (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>δ</mi><mrow data-mjx-texclass="ORD"><mi>p</mi><mo>,</mo><mi>q</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>:=</mo><msub><mi mathvariant="normal">\nabla</mi><mi>x</mi></msub><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>-</mo><msub><mi mathvariant="normal">\nabla</mi><mi>x</mi></msub><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>q</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>

is the difference in score function between

p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>

and

q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>

By choosing a $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ from some sufficiently rich $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ we can make this non-zero unless $p = q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo>=</mo><mi>q</mi></math>$ a.e., so this equation tells us something about how distinct are two densities $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ and $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ , in this slightly weird but credible-seeming sense where we care about the difference in their score functions. i.e. this is some kind of score matching method.

This looks neat. How can we calculate it in practice? Obstacle: we have not specified $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ . We could fix some $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ and use it to measure how different are $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ and $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ in some sense. Or we could choose some stochastic process which generates some random $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ s and estimate it over many $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ s, I guess? I assume that has been done.

The Stein Discrepancy takes a strong approach to controlling those $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ s: We control the supremum of that difference over all $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ in some function class $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ , so that we know that this difference $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ and $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ is not too bad for any $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ , since if we have found this Stein discrepancy, we have found how bad it is over the worst $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ :

$\sqrt S (q, p) = sup f \in F | E x \sim q [trace (A x f (X))] |$

$\sqrt{S (q, p)} = sup_{f \in F} | E_{x \sim q} [trace (A_{x} f (X))] |$

Notice we snuck in a trace there as well to make it a scalar? This ended up being the most confusing thing for me; how many dimensions even is anything in this equation?

Let us consider how we might find this ‘worst’ $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ which gives us this most powerful guarantee of the difference between $p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>$ and $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ . There are a few steps.

First, we use the linearity of that Stein operator $A x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub></math>$ , mentioned earlier. Suppose that $f <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi></math>$ can be represented as a finite linear combination $f (x) = \sum i w i f i (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>i</mi></munder><msub><mi>w</mi><mi>i</mi></msub><msub><mi>f</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ of a set of basis functions $f i (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>f</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>$ for some coefficients $w i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mi>i</mi></msub></math>$ s.t. $∥ w ∥ \leq 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo data-mjx-texclass="ORD">∥</mo><mi>w</mi><mo data-mjx-texclass="ORD">∥</mo><mo>\leq</mo><mn>1</mn></math>$ . Then we can define the ‘violation of Stein-ness’ by $E q [A x f] = E q [A x \sum i w i f i (x)] = \sum i w i β i,$

$E_{q} [A_{x} f] = E_{q} [A_{x} \sum_{i} w_{i} f_{i} (x)] = \sum_{i} w_{i} β_{i},$ where

β i = E x \sim q [A x f i (x)] . <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>β</mi><mi>i</mi></msub><mo>=</mo><msub><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">E</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi><mo>\sim</mo><mi>q</mi></mrow></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">[</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub><msub><mi>f</mi><mi>i</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">]</mo></mrow><mo>.</mo></math>

This only works for univariate densities, so far. To make the discrepancy be a scalar even for multivariate problems (in the sense of densities over multidimensional spaces) we define the violation as $E X \sim p [trace (A q f (x))] = E X \sim p [(s q (x) - s p (x)) ⊤ f (x)]$

$E_{X \sim p} [trace (A_{q} f (x))] = E_{X \sim p} [{(s_{q} (x) - s_{p} (x))}^{⊤} f (x)]$

The optimal (i.e. greatest _mis_match) coefficients $w i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mi>i</mi></msub></math>$ would then b $max w \sum i w i β i, s.t. ∥ w ∥ \leq 1$

$max_{w} \sum_{i} w_{i} β_{i}, s.t. ∥ w ∥ \leq 1$

OK, so this is notionally an optimisation problem we can solve, choosing the $w i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>w</mi><mi>i</mi></msub></math>$ values to be as terrible as possible, and then seeing how bad the most-terrible values are.

The distances arising from these are apparently integral probability metric according to (Anastasiou et al. 2023).

However, it looks like a nested optimisation problem, which can be tedious. Can we do better?

3 Kernelized Stein Discrepancy

When we see a challenge of this kind — where we wish we could use ‘more tricks’ in our function space — it typically suggests that the trick we are looking for might be the kernel trick. This entail choosing the tricky function class $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ to be a reproducing kernel Hilbert space (“RKHS”) and seeing what that does to the problem, which we call i kernelising. Frequently that makes life easier. Spoiler: it helps here too.

So, how kernelized Stein discrepancy works is as follows: $H <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow></math>$ is the RKHS with associated kernel $k : R d \times R d \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi><mo>:</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup><mo>\times</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ . We require that $k (x, x') : R d \times R d \to R <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup><mo stretchy="false">)</mo><mo>:</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup><mo>\times</mo><msup><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow><mi>d</mi></msup><mo accent="false" stretchy="false">\to</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">R</mi></mrow></math>$ be positive definite kernel. The RKHS $H <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow></math>$ with kernel $k <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math>$ includes functions of form $f (x) = \sum i w i k (x, x i) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>i</mi></munder><msub><mi>w</mi><mi>i</mi></msub><mi>k</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>x</mi><mo>,</mo><msub><mi>x</mi><mi>i</mi></msub><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ , equipped with RKHS inner product $⟨ f, g ⟩ H = \sum i j w i v j k (x i, x j) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">⟨</mo><mi>f</mi><mo>,</mo><mi>g</mi><msub><mo fence="false" stretchy="false">⟩</mo><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow></mrow></msub><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow></munder><msub><mi>w</mi><mi>i</mi></msub><msub><mi>v</mi><mi>j</mi></msub><mi>k</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>x</mi><mi>j</mi></msub><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ for $g = \sum j v j k (x, x j) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>g</mi><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mi>j</mi></munder><msub><mi>v</mi><mi>j</mi></msub><mi>k</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>x</mi><mo>,</mo><msub><mi>x</mi><mi>j</mi></msub><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ and RKHS norm $∥ f ∥ 2 H = \sum i j w i w j k (x i, x j) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo data-mjx-texclass="ORD">∥</mo><mi>f</mi><msubsup><mo data-mjx-texclass="ORD">∥</mo><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow></mrow><mn>2</mn></msubsup><mo>=</mo><munder><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mi>j</mi></mrow></munder><msub><mi>w</mi><mi>i</mi></msub><msub><mi>w</mi><mi>j</mi></msub><mi>k</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>x</mi><mi>j</mi></msub><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ .

Now we have some extra structure: $f (x) = ⟨ f (\cdot), k (x, \cdot) ⟩ H reproducing property \nabla x f (x) = ⟨ f (\cdot), \nabla x k (x, \cdot) ⟩ H gradient property$

$\begin{aligned} f (x) & = ⟨ f (\cdot), k (x, \cdot) ⟩_{H} & reproducing property \\ \nabla_{x} f (x) & = {⟨ f (\cdot), \nabla_{x} k (x, \cdot) ⟩}_{H} & gradient property \end{aligned}$

This is one of two kernels that we invoke; in fact this one gets “Steinalized” using our operator $A x <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow></msub></math>$ to give us another kernel:

$k A (x, x') := trace [A x A x' k (x, x')] .$

$\begin{aligned} k_{A} (x, x^{'}) & := trace [A_{x} A_{x^{'}} k (x, x^{'})] . \end{aligned}$

If we plug in Equation 2, we get a special form,

$k A (x, x') = \nabla x \cdot \nabla x' k (x, x') + \nabla x k (x, x') \cdot \nabla x' log p (x') + \nabla x' k (x, x') \cdot \nabla x log p (x) + k (x, x') (\nabla x log p (x)) \cdot (\nabla x' log p (x'))$

$\begin{aligned} k_{A} (x, x^{'}) = & \nabla_{x} \cdot \nabla_{x^{'}} k (x, x^{'}) \\ + \nabla_{x} k (x, x^{'}) \cdot \nabla_{x^{'}} \log p (x^{'}) \\ + \nabla_{x^{'}} k (x, x^{'}) \cdot \nabla_{x} \log p (x) \\ + k (x, x^{'}) (\nabla_{x} \log p (x)) \cdot (\nabla_{x^{'}} \log p (x^{'})) \end{aligned}$ Woof! look at those score functions everywhere!

Moreover, $E x \sim q [trace A x f (x)] = d \sum i = 1 ⟨ f i (\cdot), E x \sim q [A x k i (\cdot, x)] ⟩ H = ⟨ f (\cdot), E x \sim q [A x k (\cdot, x)] ⟩ H d$

$\begin{aligned} E_{x \sim q} [trace A_{x} f (x)] & = \sum_{i = 1}^{d} {⟨ f_{i} (\cdot), E_{x \sim q} [A_{x} k_{i} (\cdot, x)] ⟩}_{H} \\ = {⟨ f (\cdot), E_{x \sim q} [A_{x} k (\cdot, x)] ⟩}_{H^{d}} \end{aligned}$

We take $F <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow></math>$ to be the unit ball in that RKHS, i.e. $F := {f; ∥ f ∥ H d \leq 1} <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">F</mi></mrow><mo>:=</mo><mo fence="false" stretchy="false">{</mo><mi mathvariant="bold-italic">f</mi><mo>;</mo><mo data-mjx-texclass="ORD">∥</mo><mi mathvariant="bold-italic">f</mi><msub><mo data-mjx-texclass="ORD">∥</mo><mrow data-mjx-texclass="ORD"><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mi>d</mi></msup></mrow></msub><mo>\leq</mo><mn>1</mn><mo fence="false" stretchy="false">}</mo></math>$ .

Then we can write

$\sqrt S (q, p) = sup f \in H, ∥ f ∥ H d \leq 1 {E x \sim q [trace A x f (x)]} .$

$\sqrt{S (q, p)} = sup_{f \in H, ∥ f ∥_{H^{d}} \leq 1} {E_{x \sim q} [trace A_{x} f (x)]} .$ i.e. it is just the same, but we have restricted the function class to be an RKHS.

Define $β q, p (\cdot) = E x' \sim q A x k (\cdot, x') .$

$β_{q, p} (\cdot) = E_{x^{'} \sim q} A_{x} k (\cdot, x^{'}) .$

Finding that supremum is then equivalent to solving $sup f ⟨ f, β q, p ⟩ H, s.t. ∥ f ∥ H \leq 1 .$

$sup_{f} {⟨ f, β_{q, p} ⟩}_{H}, s.t. ∥ f ∥_{H} \leq 1 .$

From this we get $ϕ (x) = ϕ * q, p (x) / ‖ ϕ * q, p ‖ H d <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ϕ</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msubsup><mi>ϕ</mi><mrow data-mjx-texclass="ORD"><mi>q</mi><mo>,</mo><mi>p</mi></mrow><mo>*</mo></msubsup><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN" symmetric="true">‖</mo><msubsup><mi>ϕ</mi><mrow data-mjx-texclass="ORD"><mi>q</mi><mo>,</mo><mi>p</mi></mrow><mo>*</mo></msubsup><mo data-mjx-texclass="CLOSE" symmetric="true">‖</mo></mrow><mrow data-mjx-texclass="ORD"><msup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow><mi>d</mi></msup></mrow></msub></math>$ , where $ϕ * q, p (\cdot) = E x \sim q [A x k (x, \cdot)], for which we have S (q, p) = ‖ ϕ * q, p ‖ 2 H d$

$ϕ_{q, p}^{*} (\cdot) = E_{x \sim q} [A_{x} k (x, \cdot)], for which we have S (q, p) = {‖ ϕ_{q, p}^{*} ‖}_{H^{d}}^{2}$

We maximize this, I assert, if we set $f = β q, p / ‖ β q, p ‖ H <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>f</mi><mo>=</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>q</mi><mo>,</mo><mi>p</mi></mrow></msub><mrow data-mjx-texclass="ORD"><mo>/</mo></mrow><msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN" symmetric="true">‖</mo><msub><mi>β</mi><mrow data-mjx-texclass="ORD"><mi>q</mi><mo>,</mo><mi>p</mi></mrow></msub><mo data-mjx-texclass="CLOSE" symmetric="true">‖</mo></mrow><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">H</mi></mrow></mrow></msub></math>$ , normalising it to be on the unit ball (question: why can it not be on the interior?) at the point that maximises the expectation. Thus $S (q, p) = ‖ β q, p ‖ 2 H d = E x, x' \sim q [κ p (x, x')]$

$\begin{aligned} S (q, p) & = {‖ β_{q, p} ‖}_{H^{d}}^{2} \\ = E_{x, x^{'} \sim q} [κ_{p} (x, x^{'})] \end{aligned}$ where

κ p (x, x') := A x x A x' x k (x, x') . <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msub><mi>κ</mi><mi>p</mi></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>x</mi><mo>,</mo><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>:=</mo><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mi>x</mi></msubsup><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mrow data-mjx-texclass="ORD"><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup></mrow></msubsup><mi>k</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>x</mi><mo>,</mo><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>.</mo></math>

Here we defined

A x x <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mi>x</mi></msubsup></math>

and

A x' x <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">A</mi></mrow><mrow data-mjx-texclass="ORD"><mi>x</mi></mrow><mrow data-mjx-texclass="ORD"><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup></mrow></msubsup></math>

represents the Stein operator w.r.t. variable

x <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math>

and

x' <math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup></math>

, respectively.

κ p (x, x') <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>κ</mi><mi>p</mi></msub><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>x</mi><mo>,</mo><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>

is the “Steinalized” kernel obtained by applying Stein operator on

k (x, x') <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mi>x</mi><mo>,</mo><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>

twice.

$S (p, q) = E x, x' \sim p [δ q, p (x) ⊤ k (x, x') δ q, p (x')],$

$S (p, q) = E_{x, x^{'} \sim p} [δ_{q, p} (x)^{⊤} k (x, x^{'}) δ_{q, p} (x^{'})],$ where

δ q, p (x) = s q (x) - s p (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="bold-italic">δ</mi><mrow data-mjx-texclass="ORD"><mi>q</mi><mo>,</mo><mi>p</mi></mrow></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>=</mo><msub><mi>s</mi><mi>q</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo><mo>-</mo><msub><mi>s</mi><mi>p</mi></msub><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>

is the score difference between

p <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math>

and

q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>

, and

x, x' <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi><mo>,</mo><msup><mi>x</mi><mo data-mjx-alternate="1">'</mo></msup></math>

are i.i.d. draws from

p (x) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></math>

It is a mess to write out in full though.

4 Stein Variational Gradient Descent

The next bit comes from Q. Liu and Wang (2019). It turns out that we can use this Stein trick to sample from some interesting distributions, by using the Stein discrepancy as a loss function. Interestingly, this works on posterior distributions in particular.

We manufacture an empirical $q <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>q</mi></math>$ by using a set of particles ${x i} n i = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mo fence="false" stretchy="false">{</mo><msub><mi>x</mi><mi>i</mi></msub><msubsup><mo fence="false" stretchy="false">}</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></msubsup></math>$ .

The gradient descent here is not SGD where we assimilate gradient steps by looking at examples; it is rather a gradient descent in parameter space which converges towards a good approximation of the posterior.

A worked example will sort this out.

import jax
import jax.numpy as jnp
from jax import grad, jit, vmap
from jax.scipy.stats import norm
import plotly.graph_objects as go
import plotly.io as pio

pio.templates.default = "none"

# Define the target distribution (standard normal in this case)
def log_p(x, rho):
    return -0.5 * (
        x[..., 0]**2 + x[..., 1]**2
         - 2 * rho * x[..., 0] * x[..., 1]
    ) / (
        1 - rho**2
    ) - jnp.log(
        2 * jnp.pi * jnp.sqrt(1 - rho**2)
    )
# Define the RBF kernel
def rbf_kernel(x, y, h):
    return jnp.exp(-jnp.sum((x - y)**2) / (2 * h**2))

# Compute the Stein kernel
@jit
def stein_kernel(x, x_prime, h, rho):
    k = rbf_kernel(x, x_prime, h)
    grad_k_x = grad(rbf_kernel, argnums=0)(x, x_prime, h)
    grad_k_x_prime = grad(rbf_kernel, argnums=0)(x_prime, x, h)

    grad_log_p_x = grad(log_p)(x, rho)
    grad_log_p_x_prime = grad(log_p)(x_prime, rho)

    term1 = jnp.dot(grad_k_x, grad_k_x_prime)
    term2 = jnp.dot(grad_k_x, grad_log_p_x_prime)
    term3 = jnp.dot(grad_k_x_prime, grad_log_p_x)
    term4 = k * jnp.dot(grad_log_p_x, grad_log_p_x_prime)

    return term1 + term2 + term3 + term4

# Define the SVGD update
@jit
def svgd_update(particles, h, rho, lr=0.001):
    n_particles = particles.shape[0]
    updates = jnp.zeros_like(particles)

    def particle_update(i, updates):
        x_i = particles[i]
        grad_log_p_i = grad(log_p)(x_i, rho)

        def kernel_and_grad(j):
            x_j = particles[j]
            k_stein = stein_kernel(x_i, x_j, h, rho)
            return k_stein

        k_stein_values = vmap(kernel_and_grad)(jnp.arange(n_particles))

        phi = jnp.mean(k_stein_values, axis=0)
        updates = updates.at[i].set(phi)
        return updates

    updates = jax.lax.fori_loop(0, n_particles, particle_update, updates)

    return particles + lr * updates

# Initialize particles
key = jax.random.PRNGKey(0)
n_particles = 20
particles = jax.random.normal(key, (n_particles, 2))  # 2D particles

# Set kernel bandwidth
h = 0.1
rho = 0.8  # Correlation parameter

# Set the number of iterations
n_iterations = 1000

# Store particle locations at different stages
initial_particles = particles.copy()
mid_particles = None
final_particles = None

# Run SVGD for a few iterations
for i in range(n_iterations):
    particles = svgd_update(particles, h, rho)

    # Detect NaN values
    if jnp.isnan(particles).any():
        num_nan_particles = jnp.isnan(particles).any(axis=1).sum()
        raise ValueError(f"Detected {num_nan_particles} NaN values at iteration {i + 1}. Diagnostic Info: particles shape {particles.shape}, step number {i + 1}")

    if i == n_iterations // 2:
        mid_particles = particles.copy()

final_particles = particles.copy()

# Create a mesh grid for plotting the density
# Create a mesh grid for plotting the density
x = np.linspace(-3, 3, 100)
y = np.linspace(-3, 3, 100)
X, Y = np.meshgrid(x, y)
XY = np.stack([X.ravel(), Y.ravel()], axis=-1)

# Compute the log-density for the mesh grid
rho = 0.5  # Example correlation value
Z = np.exp(jax.vmap(lambda xy: log_p(xy, rho))(XY).reshape(X.shape))

fig = go.Figure()

# Add the density heatmap
fig.add_trace(go.Contour(
    x=x,
    y=y,
    z=Z,
    colorscale='Viridis',
    opacity=0.5
))

# Add the scatter plot of initial particles
fig.add_trace(go.Scatter(
    x=initial_particles[:, 0],
    y=initial_particles[:, 1],
    mode='markers',
    marker=dict(size=5, color='red', opacity=0.8),
    name='Initial Particles'
))

# Add the scatter plot of middle particles
fig.add_trace(go.Scatter(
    x=mid_particles[:, 0],
    y=mid_particles[:, 1],
    mode='markers',
    marker=dict(size=5, color='green', opacity=0.8),
    name='Middle Particles'
))

# Add the scatter plot of final particles
fig.add_trace(go.Scatter(
    x=final_particles[:, 0],
    y=final_particles[:, 1],
    mode='markers',
    marker=dict(size=5, color='blue', opacity=0.8),
    name='Final Particles'
))

fig.update_layout(
    title='SVGD Particles at Different Stages with Target Density',
    xaxis_title='x1',
    yaxis_title='x2',
    font=dict(family="Alegreya, serif"),
    template=pio.templates.default,
    legend=dict(
        x=0,
        y=1,
        traceorder='normal',
        font=dict(size=12),
        bgcolor='rgba(255, 255, 255, 0.5)',
        bordercolor='Black',
        borderwidth=1
    )
)

fig.show()

Figure 4: Let us check out some generic Stein VI

5 For mixtures

Mixtures in general are helpful in variational inference (Ranganath, Tran, and Blei 2016). See ELBO-within-Stein (Rønning et al. 2021), Nonlinear Stein (D. Wang and Liu 2019), Stein Mixtures (Nalisnick and Smyth 2017)…

6 Stochastic variants

(Li et al. 2020; Zhang et al. 2020)

7 As moment matching

See Q. Liu and Wang (2018).

8 By message passing

Define a kernel over factors and now the Stein updates become local messages. Discovered simultaneously in 2018 by D. Wang, Zeng, and Liu (2018) and Zhuo et al. (2018), and elaborated/ expanded/varied in subsequent works. (Pavlasek et al. 2024; Zhou and Qiu 2023).

This turns Stein VGD into a particle message passing algorithm, so read more there.

9 Incoming

Qiang Liu did a lot of the groundwork and produced some helpful resources
The Stein Gradient | Sanyam Kapoor

10 References

Abbasi-Yadkori, Pacchiano, and Phan. 2020. “Regret Balancing for Bandit and RL Model Selection.” arXiv.org.

Alsup, Venturi, and Peherstorfer. 2022. “Multilevel Stein Variational Gradient Descent with Applications to Bayesian Inverse Problems.” In Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference.

Ambrogioni, Güçlü, Güçlütürk, et al. 2018. “Wasserstein Variational Inference.” In Proceedings of the 32Nd International Conference on Neural Information Processing Systems. NIPS’18.

Anastasiou, Barp, Briol, et al. 2023. “Stein’s Method Meets Computational Statistics: A Review of Some Recent Developments.” Statistical Science.

Chakraborty, Bedi, Koppel, et al. 2023. “STEERING : Stein Information Directed Exploration for Model-Based Reinforcement Learning.” In Proceedings of the 40th International Conference on Machine Learning.

Chen, and Ghattas. 2020. “Projected Stein Variational Gradient Descent.” In Proceedings of the 34th International Conference on Neural Information Processing Systems. NIPS ’20.

Chen, Wu, Chen, et al. 2020. “Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions.” In.

Chu, Minami, and Fukumizu. 2022. “The Equivalence Between Stein Variational Gradient Descent and Black-Box Variational Inference.” In.

Chwialkowski, Strathmann, and Gretton. 2016. “A Kernel Test of Goodness of Fit.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.

Detommaso, Cui, Spantini, et al. 2018. “A Stein Variational Newton Method.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.

Detommaso, Hoitzing, Cui, et al. 2019. “Stein Variational Online Changepoint Detection with Applications to Hawkes Processes and Neural Networks.” arXiv:1901.07987 [Cs, Stat].

Feng, Wang, and Liu. 2017. “Learning to Draw Samples with Amortized Stein Variational Gradient Descent.” In UAI 2017.

Gong, Peng, and Liu. 2019. “Quantile Stein Variational Gradient Descent for Batch Bayesian Optimization.” In Proceedings of the 36th International Conference on Machine Learning.

Gorham, and Mackey. 2015. “Measuring Sample Quality with Stein’s Method.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. NIPS’15.

———. 2017. “Measuring Sample Quality with Kernels.” In Proceedings of the 34th International Conference on Machine Learning.

Gorham, Raj, and Mackey. 2020. “Stochastic Stein Discrepancies.” arXiv:2007.02857 [Cs, Math, Stat].

Han, Ding, Liu, et al. 2020. “Stein Variational Inference for Discrete Distributions.” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics.

Han, and Liu. 2018. “Stein Variational Gradient Descent Without Gradient.” In Proceedings of the 35th International Conference on Machine Learning.

Huggins, Campbell, Kasprzak, et al. 2018. “Scalable Gaussian Process Inference with Finite-Data Mean and Variance Guarantees.” arXiv:1806.10234 [Cs, Stat].

Ley, Reinert, and Swan. 2017. “Stein’s Method for Comparison of Univariate Distributions.” Probability Surveys.

Li, Li, Liu, et al. 2020. “A Stochastic Version of Stein Variational Gradient Descent for Efficient Sampling.” Communications in Applied Mathematics and Computational Science.

Liu, Qiang. 2016a. “A Short Introduction to Kernelized Stein Discrepancy.”

———. 2016b. “Stein Variational Gradient Descent: Theory and Applications.”

———. 2017. “Stein Variational Gradient Descent as Gradient Flow.”

Liu, Qiang, Lee, and Jordan. 2016. “A Kernelized Stein Discrepancy for Goodness-of-Fit Tests.” In Proceedings of The 33rd International Conference on Machine Learning.

Liu, Qiang, and Wang. 2018. “Stein Variational Gradient Descent as Moment Matching.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.

———. 2019. “Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In Advances In Neural Information Processing Systems.

Liu, Chang, and Zhu. 2018. “Riemannian Stein Variational Gradient Descent for Bayesian Inference.” Proceedings of the AAAI Conference on Artificial Intelligence.

Liu, Chang, Zhuo, Cheng, et al. 2019. “Understanding and Accelerating Particle-Based Variational Inference.” In Proceedings of the 36th International Conference on Machine Learning.

Liu, Xing, Zhu, Ton, et al. 2022. “Grassmann Stein Variational Gradient Descent.” In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics.

Markatou, Karlis, and Ding. 2021. “Distance-Based Statistical Inference.” Annual Review of Statistics and Its Application.

Matsubara, Knoblauch, Briol, et al. 2022. “Robust Generalised Bayesian Inference for Intractable Likelihoods.” Journal of the Royal Statistical Society Series B: Statistical Methodology.

Nalisnick, and Smyth. 2017. “Variational Inference with Stein Mixtures.” In NIPS2017 (Workshop).

Oates, Girolami, and Chopin. 2017. “Control Functionals for Monte Carlo Integration.” Journal of the Royal Statistical Society Series B: Statistical Methodology.

Pavlasek, Mah, Xu, et al. 2024. “Stein Variational Belief Propagation for Multi-Robot Coordination.”

Pielok, Bischl, and Rügamer. 2023. “Approximate Bayesian Inference with Stein Functional Variational Gradient Descent.” In.

Pulido, and van Leeuwen. 2019. “Sequential Monte Carlo with Kernel Embedded Mappings: The Mapping Particle Filter.” Journal of Computational Physics.

Pulido, Van Leeuwen, and Posselt. 2019. “Kernel Embedded Nonlinear Observational Mappings in the Variational Mapping Particle Filter.” In Computational Science – ICCS 2019. ICCS 2019. Lecture Notes in Computer Science.

Ranganath, Tran, and Blei. 2016. “Hierarchical Variational Models.” In PMLR.

Rønning. 2023. “A Probabilistic Approach to the Protein Fold- Ing Problem Using Stein-Based Variational Inference.”

Rønning, Al-Sibahi, Ley, et al. 2021. “EinSteinVI: General and Integrated Stein Variational Inference.”

Stordal, Moraes, Raanes, et al. 2021. “P-Kernel Stein Variational Gradient Descent for Data Assimilation and History Matching.” Mathematical Geosciences.

Tamang, Ebtehaj, van Leeuwen, et al. 2021. “Ensemble Riemannian Data Assimilation over the Wasserstein Space.” Nonlinear Processes in Geophysics.

Wang, Dilin, and Liu. 2019. “Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models.” In Proceedings of the 36th International Conference on Machine Learning.

Wang, Ziyu, Ren, Zhu, et al. 2018. “Function Space Particle Optimization for Bayesian Neural Networks.” In.

Wang, Dilin, Tang, Bajaj, et al. 2019. “Stein Variational Gradient Descent with Matrix-Valued Kernels.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems.

Wang, Dilin, Zeng, and Liu. 2018. “Stein Variational Message Passing for Continuous Graphical Models.”

Wen, and Li. 2022. “Affine-Mapping Based Variational Ensemble Kalman Filter.” Statistics and Computing.

Xu, and Matsuda. 2021. “Interpretable Stein Goodness-of-Fit Tests on Riemannian Manifolds.” arXiv:2103.00895 [Stat].

Yang, Liu, Rao, et al. 2018. “Goodness-of-Fit Testing for Discrete Distributions via Stein Discrepancy.” In Proceedings of the 35th International Conference on Machine Learning.

Zhang, Zhang, Carin, et al. 2020. “Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory.” In International Conference on Artificial Intelligence and Statistics.

Zhao, Wang, Zhu, et al. 2023. “Stein Variational Gradient Descent with Learned Direction.” Information Sciences.

Zhou, and Qiu. 2023. “Augmented Message Passing Stein Variational Gradient Descent.”

Zhuo, Liu, Shi, et al. 2018. “Message Passing Stein Variational Gradient Descent.” In Proceedings of the 35th International Conference on Machine Learning.