(Kernelized) Stein variational gradient descent and other computational Stein discrepancy methods

KSVD, SVGD

November 2, 2022 — March 12, 2024

approximation
Bayes
functional analysis
Markov processes
measure
metrics
Monte Carlo
optimization
probabilistic algorithms
probability
statistics
Figure 1: Unit balls in the RKHS.
Figure 2: Q. Liu (2016a)’s diagram of the relationships between this method, kernel methods, Stein’s method, variational inference, maximum mean discrepancy and Fisher information. Bonus points for coining Steinalization.

Stein’s method meets variational inference via kernels and probability measures. The result is method of inference which maintains an ensemble of particles which notionally collectively sample from some target distribution. I should learn about this, as one of the methods I might use for low-assumption Bayes inference.

There seems to be a standard narrative that we walk through here, which I find very confusing. So here I walk through with laborious worked examples.

1 Stein operators

This seems to have been invented in Q. Liu, Lee, and Jordan (2016) and Chwialkowski, Strathmann, and Gretton (2016), weaponized in Q. Liu (2016b).

Let us introduce the bits we need. We start with the classic Stein’s identity, which with a small amount of work gives us an easy trick to approximate densities.

We care about a target density \(p\) and a another \(q\). \(p\) should be differentiable. They are both be over, by assumption \(\mathcal{X}\subseteq\mathbb{R}^d\). We also introduce a family of \(\mathbb{R}^d\) to \(\mathbb{R}^d\) test functions \(\mathcal{F}\). We require that \(\lim_{x \to \pm \infty} p(xb)f(xb) = 0\) for \(\|b\|=1\), and some other stuff which we get to in a moment.

Should say more about the generic class \(\mathcal{F}\). e.g. Q. Liu, Lee, and Jordan (2016) does.

Next, we choose a Stein operator \(\mathcal{A}_p: \mathcal{F}\to \mathcal{G}\). \(\mathcal{F}\) and \(\mathcal{G}\) are spaces of functions from \(\mathbb{R}^d\) to \(\mathbb{R}^d\), but I gave them different names because it is not clear to me that they are necessarily the same space, but we can probably ignore that detail for now. We do not go into the details of the requirements of the spaces, but they should be smooth functions that are square-integrable (with respect to the target distribution \(p\)? or Lebesgue measure? something else?) whose derivatives also go to zero at infinity. For a function \(\mathcal{A}_p\) to be a Stein operator for a target distribution \(p(x)\), it must satisfy the following key property:

Stein’s Operators: \(\mathcal{A}_p\) is a Stein operator with respect to a suitable class of test functions \(\mathcal{F}\), if the expectation of \(\mathcal{A}_p f(X)\) under the target distribution \(p(x)\) is zero, i.e. for all those \(f\) in \(\mathcal{F}\),

\[ \mathbb{E}_{p}[\mathcal{A}_p f(X)] = 0. \tag{1}\]

Note that \(p\) restricts our choice of \(\mathcal{A}_p\) but AFAICT does not uniquely determine it. I do not know if that is useful. I wonder if we can choose \(\mathcal{F}\) cunningly to admit some spicy alternate \(\mathcal{A}_p\).

For \(\mathcal{F}\) which include a non-trivial linear subspace, we can see that \(\mathcal{A}_p\) must be linear, because expectation is linear, and otherwise we could make linear changes to an \(f\) and end up violating the equality.

The classic choice is to set the Stein Operator to \[ \mathcal{A}_p f(x):=f(x) \nabla_x \cdot \log p(x)+\nabla_x \cdot f(x). \tag{2}\]

AFAICT we are essentially assuming an exponential family density here, or something?

Anyway, let us make this more concrete, by choosing a specific \(p\) which is not trivial but not baffling either. I reckon a 2d Gaussian with standard deviation 1 and mean 0 with do the trick. Let us give it a correlation \(\rho\), which we leave unspecified, to keep things spicy. This implies mean \(\mu = (0, 0)\) and covariance \(\Sigma = \left[\begin{smallmatrix} 1 & \rho \\ \rho & 1 \end{smallmatrix}\right]\), and thence inverse covariance \(\Sigma^{-1} = \tfrac{1}{1-\rho^2}\left[\begin{smallmatrix} 1 & -\rho \\ -\rho & 1 \end{smallmatrix}\right]\). The pdf for this distribution is \[ \begin{aligned} p(\mathbf{x}) &= \frac{1}{2\pi \sqrt{|\Sigma|}} \exp\left(-\frac{1}{2} (x-\mu)^{\top} \Sigma^{-1} (x-\mu)\right)\\ &= \frac{1}{2\pi \sqrt{1-\rho^2}} \exp\left(-\frac{1}{2} \begin{bmatrix} x_1 & x_2 \end{bmatrix} \frac{1}{1-\rho^2} \begin{bmatrix} 1 & -\rho \\ -\rho & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\right)\\ &= \frac{1}{2\pi \sqrt{1-\rho^2}} \exp\left(-\frac{1}{2(1-\rho^2)}\left(x_1^2 - 2\rho x_1 x_2 + x_2^2\right)\right). \end{aligned} \]

We can simplify the Stein operator for this choice of \(p\), since \[ \begin{aligned} \nabla_x \log p(x) &= \nabla_x \log \left(\frac{1}{2\pi \sqrt{1-\rho^2}} \exp\left(-\frac{1}{2(1-\rho^2)}\left(x_1^2 - 2\rho x_1 x_2 + x_2^2\right)\right)\right)\\ &= \nabla_x \left( -\frac{1}{2(1-\rho^2)} \left( x_1^2 - 2\rho x_1 x_2 + x_2^2\right)\right)\\ &= -\frac{1}{2(1-\rho^2)}\nabla_x\left(x_1^2 - 2\rho x_1 x_2 + x_2^2\right)\\ &= \frac{1}{1-\rho^2}\begin{bmatrix} x_1 - \rho x_2\\ x_2 - \rho x_1 \end{bmatrix} \end{aligned} \]

Equation 1 comes out to

\[ \begin{aligned} \mathbb{E}_p[\mathcal{A}_p f(X)] &= \mathbb{E}_p\Bigg[ \begin{bmatrix} -\frac{X_1 - \rho X_2}{1-\rho^2} \\ -\frac{X_2 - \rho X_1}{1-\rho^2} \end{bmatrix} \cdot \begin{bmatrix} f_1(X) \\ f_2(X) \end{bmatrix} \\ & \qquad + \begin{bmatrix} \partial_{X_1} \\ \partial_{X_2} \end{bmatrix} \cdot \begin{bmatrix} f_1(X) \\ f_2(X) \end{bmatrix} \Bigg]\\ &= \mathbb{E}_p\left[ -\frac{f_1(X)(X_1 - \rho X_2) + f_2(X)(X_2 - \rho X_1)}{1-\rho^2} + \partial_{X_1} f_1(X) + \partial_{X_2} f_2(X) \right]. \end{aligned} \]

We can choose some stupid \(f\). A linear one is \(f(x) = \begin{bmatrix} a_1 x_1 + b_1 x_2 \\ a_2 x_1 + b_2 x_2 \end{bmatrix}\). The expectation of the Stein operator for our bivariate Gaussian is then

\[ \begin{aligned} \mathbb{E}_p[\mathcal{A}_p f(X)] &= \mathbb{E}_p\Bigg[ \begin{bmatrix} -\frac{X_1 - \rho X_2}{1-\rho^2} \\ -\frac{X_2 - \rho X_1}{1-\rho^2} \end{bmatrix} \cdot \begin{bmatrix} a_1 X_1 + b_1 X_2 \\ a_2 X_1 + b_2 X_2 \end{bmatrix}\\ & \qquad + a_1 + b_2 \Bigg]\\ % &=\frac{1}{1-\rho^2} \mathbb{E}_p\left[-a_1 X_1^2+a_1 \rho X_1 X_2+a_2 \rho X_1^2-a_2 X_1 X_2-b_1 X_1 X_2+b_1 \rho X_2^2+b_2 \rho X_1 X_2-b_2 X_2^2\right] + a_1 + b_2 \\ % &=\frac{1}{1-\rho^2} \mathbb{E}_p\left[-a_1 X_1^2+a_2 \rho X_1^2 + a_1 \rho X_1 X_2-a_2 X_1 X_2-b_1 X_1 X_2+b_2 \rho X_1 X_2+b_1 \rho X_2^2-b_2 X_2^2\right] + a_1 + b_2 \\ &=\frac{1}{1-\rho^2} \mathbb{E}_p\big[(-a_1 +a_2 \rho) X_1^2 \\ & \qquad + ( a_1 \rho -a_2 -b_1 +b_2 \rho)X_1 X_2 \\ & \qquad +(b_1 \rho -b_2) X_2^2\big] \\ & \qquad + a_1 + b_2 \\ &=\frac{1}{1-\rho^2} \big[ (-a_1 +a_2 \rho) \operatorname{var} X_1 \\ &\qquad+( a_1 \rho -a_2 -b_1 \\ &\qquad+b_2 \rho)\operatorname{cov}(X_1, X_2)\\ &\qquad +(b_1 \rho -b_2) \operatorname{var}X_2\big] + a_1 + b_2 \\ % &=\frac{-a_1 \operatorname{var} X_1 +a_1 \rho \operatorname{cov} (X_1,X_2)+a_2 \rho \operatorname{var} X_1-a_2 \operatorname{cov}( X_1, X_2)-b_1 \operatorname{cov}( X_1, X_2)+b_1 \rho \operatorname{var} X_2+b_2 \rho\operatorname{cov}(X_1, X_2)-b_2 \operatorname{var} X_2}{1-\rho^2} + a_1 + b_2 \\ &=\frac{-a_1 +a_1 \rho ^2+a_2 \rho - a_2 \rho-b_1 \rho+b_1 \rho +b_2 \rho^2 -b_2}{1-\rho^2} + a_1 + b_2 \\ &=\frac{-a_1(1- \rho ^2) -b_2 (1 -\rho^2)}{1-\rho^2} + a_1 + b_2 =0.\\ \end{aligned} \]

Phew! OK that worked. I itch to plot this, but note that even in 2 dimensions it is tricky to plot, because our functions are 2d functions of 2d inputs, so we need 4 dimensions to plot them.

I could do a 1d example, but you can find loads of those on the internet, and I did not find them helpful because it is too easy to lose sight of how this method handles high-dimensional spaces.

We can juuuuust about do it in 2 dimensions by using a quiver plot and treating it as a weird vector field; here the arrows show the output space for an arbitrary stoopid linear \(f\) and the contours show the density of the target distribution \(p\).

Code
import jax.numpy as jnp
import numpy as np
from jax import grad, vmap, jacfwd, jacrev
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
from _plotly_styles import textbook

pio.templates.default = "none"
# Define shared parameters for scale
A_scale_factor = 1.5
f_scale_factor = 0.1
arrow_scale_factor = 0.3
n = 25

# Define the generic function f
def f(x, a1, b1, a2, b2):
    return jnp.stack([
        a1 * x[..., 0] + b1 * x[..., 1],
        a2 * x[..., 0] + b2 * x[..., 1]],
        axis=-1)

# Define the log-density of the generic probability density function p
def log_p(x, rho):
    return -0.5 * (
        x[..., 0]**2 + x[..., 1]**2
         - 2 * rho * x[..., 0] * x[..., 1]
    ) / (
        1 - rho**2
    ) - jnp.log(
        2 * jnp.pi * jnp.sqrt(1 - rho**2)
    )

# Fix specific values for rho and the parameters of f
rho = 0.5
a1, b1, a2, b2 = 1, 0.25, -0.75, -1
f_specific = lambda x: f(x, a1, b1, a2, b2)
log_p_specific = lambda x: log_p(x, rho)

# Define the Stein operator applied to a specific f and p
def A_p_f(x, f, log_p):
    grad_log_p = vmap(grad(log_p))(x)
    jac_f = vmap(jacfwd(f))(x)
    return grad_log_p[:, None, :] * f(x)[:, :, None] + jac_f

# Create a grid of points
x1, x2 = np.meshgrid(
    np.linspace(-3, 3, n, endpoint=True),
    np.linspace(-3, 3, n, endpoint=True)
)
x = np.stack([x1, x2], axis=-1).reshape(-1, 2)

# Compute the function f at each point in x
f_x = f_specific(x)
p_x = np.exp(log_p_specific(x))

# Compute the Stein operator for f at each point in x
A_f_p_x = A_p_f(x, f_specific, log_p_specific)
p_A_f_p_x = A_f_p_x * p_x[:, None, None]

# Create a subplot with 2 rows and 1 column
fig = make_subplots(
    rows=2, cols=1,
    subplot_titles=(
        '<i>f</i>',
        '<i>p A<sub>p</sub> f</i>'),
    vertical_spacing=0.1,
    row_heights=[0.5, 0.5])

textbook(fig)
# Add the quiver plot for the function f to the first subplot
fig.add_trace(
    ff.create_quiver(
        x1, x2,
        f_x[:, 0].reshape(x1.shape), f_x[:, 1].reshape(x2.shape),
        scale=f_scale_factor,
        arrow_scale=arrow_scale_factor,
        name='f', line_width=1).data[0],
    row=1, col=1
)

# Add the quiver plot for the first component of the Stein operator to the second subplot
fig.add_trace(
    ff.create_quiver(
        x1, x2,
        p_A_f_p_x[:, 0, 0].reshape(x1.shape),
        p_A_f_p_x[:, 1, 0].reshape(x2.shape),
        scale=A_scale_factor,
        arrow_scale=arrow_scale_factor,
        name='Component 1', line_width=1,
        line_color='blue').data[0],
    row=2, col=1
)

# Add the quiver plot for the second component of the Stein operator to the second subplot
fig.add_trace(
    ff.create_quiver(
        x1, x2,
        p_A_f_p_x[:, 0, 1].reshape(x1.shape),
        p_A_f_p_x[:, 1, 1].reshape(x2.shape),
        scale=A_scale_factor,
        arrow_scale=arrow_scale_factor,
        name='Component 2', line_width=1,
        line_color='red').data[0],
    row=2, col=1
)

# Set the layout
fig.update_layout(
    width=400,
    height=800,
    font=dict(family="Alegreya, serif"),
    paper_bgcolor='rgba(0,0,0,0)',
    plot_bgcolor='rgba(0,0,0,0)',
    showlegend=False
)

# Update axis ranges and aspect ratio for both subplots
axis_range = [-3, 3]
for i in range(1, 3):
    fig.update_xaxes(
        range=axis_range, row=i, col=1)
    fig.update_yaxes(
        range=axis_range, row=i, col=1,
        scaleanchor=f'x{i}', scaleratio=1)

# Show the plot
fig.show()
Figure 3: A sorta-kinda visualization of the Stein operator for a 2d Gaussian

Did that help us? Well, kinda. It is not really clear to me that I should trust that the second figure should actually integrate to 0. Did it?

p_A_f_p_x.sum().item()
-0.21152114868164062

Hm, not exactly convincingly 0. Numerical approximation problem, truncation error or actual bug? TBD. For now, I’m out of time, we really need to be moving along.

2 Stein discrepancy

We make Equation 1 into a quantity that depends on two potentially-different densities by taking the expectation over a different density \(q\) than the one that generated the operator \(\mathcal{A}_p,\) and seeing if that does something useful:

\[ \mathbb{E}_{x\sim q}[\mathcal{A}_p f(X)]=0 \tag{3}\]

Spoiler, it does, because if \(p\neq q\) a.e. then it turns out that

\[ \begin{aligned} \mathbb{E}_{x\sim q}[\mathcal{A}_p f(X)] &=\mathbb{E}_{x\sim q}[\mathcal{A}_p f(X)] - \overbrace{\mathbb{E}_{x\sim q}[\mathcal{A}_q f(X)]}^{=0}\\ &=\mathbb{E}_{x\sim q}\big[f(x) \cdot \nabla_x \log p(x)+\nabla_x \cdot f(x)\\ &\qquad-f(x) \cdot \nabla_x \log p(x)-\nabla_x \cdot f(x)\big]\\ &=\mathbb{E}_{x\sim q}\left[f(x) \cdot \nabla_x \log p(x) -f(x) \cdot \nabla_x \log q(x)\right]\\ &=\mathbb{E}_{x\sim q}\left[f(x) \delta_{p,q}(x) \right] \end{aligned} \]

where \(\delta_{p,q}(x):= \nabla_x \log p(x) -\nabla_x \log q(x)\) is the difference in score function between \(p\) and \(q\). By choosing a suitable \(f\) from some sufficiently rich \(\mathcal{F}\) we can make this non-zero unless \(p=q\) a.e., so this equation tells us something about how distinct are two densities \(p\) and \(q\), in this slightly weird but credible-seeming thing where we care about the difference in their score functions.

This is neat. How can we calculate it in practice? Because, note, we have not specified \(f\). We could fix some \(f\) and use it to measure how different are \(p\) and \(q\) in some sense. Or we could choose some stochastic process which generates some random \(f\)s and estimate it over many \(f\)s, I guess? I assume that has been done.

The Stein Discrepancy takes a strong approach to controlling those \(f\)s. We make take the supremum over all \(f\) in some function class \(\mathcal{F}\), so that we know that all the \(f\)s of interest have been controlled, since we controlling the badness of the worst one in that class:

\[ \sqrt{S(q, p)} = \sup_{f \in \mathcal{F}} \left| \mathbb{E}_{x \sim q}[\operatorname{trace}(\mathcal{A}_p f(X))] \right| \]

Notice we snuck in a trace there as well to make it a scalar?

I said the Stein discrepancy, but it looks like each choice of \(\mathcal{F}\) and operator \(\mathcal{A}_p\) gives us a different Stein discrepancy, no?

Let us consider how we might find this ‘worst’ \(f\) which gives us this most powerful guarantee of the difference between \(p\) and \(q\). We use the linearity of Stein operator \(\mathcal{A}_p\), mentioned earlier. Suppose further that \(f\) can be represented as a finite linear combination \(f(x)=\sum_i w_i f_i(x)\) of a set of basis functions \(f_i(x)\) for some coefficients \(w_i\) s.t. \(\|w\| \leq 1\). Then we can write \[ \mathbb{E}_q\left[\mathcal{A}_p f\right]=\mathbb{E}_q\left[\mathcal{A}_p \sum_i w_i f_i(x)\right]=\sum_i w_i \beta_i, \] where \[ \beta_i=\mathbb{E}_{x \sim q}\left[\mathcal{A}_p f_i(x)\right] . \] And the optimal coefficient \(w_i\) is \[ \max _w \sum_i w_i \beta_i, \quad \text { s.t. } \quad\|w\| \leq 1 \]

Cool. If we are suspicious about this finite basis assumption we might want to have something more general; that is next.

3 Kernelized Stein discrepancy

As usual, when we wonder if we can take a trick that works on finite basis to an infinite basis, we are likely to wonder if the kernel trick helps, choosing the function class \(\mathcal{F}\) to be a reproducing kernel Hilbert space. We can choose other function classes that are tractable to work with apart from RKHSs (e.g. Gorham and Mackey (2015) constructs an upper bound in some set of smooth functions evaluated at some finite set of points).

But actually, everyone does the kernelization thing.

A Kernelized Stein discrepancy uses a special choice of \(\mathcal{F}\), a reproducing kernel Hilbert space (RKHS) \(\mathcal{H}^d\), where \(d\) is the dimension of the input space. \(\mathcal{H}\) is the RKHS with kernel \(k:\mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}\), and \(\mathcal{H}^d\) is the product space consisting of elements \(f = (f_1, f_2, \ldots, f_d)\), where each \(f_i \in \mathcal{H}\), with standard inner product \(\langle f, g\rangle_{\mathcal{H}^d}=\sum_{i=1}^d \langle f_i, g_i\rangle_{\mathcal{H}}.\) For some reason, many papers suppress the \(d\) and the product construction, which is utterly baffling for a n00b like me when suddenly the dimensions do not match up. They mention it in Chwialkowski, Strathmann, and Gretton (2016) though.

OK, so let us do the naive thing and hope that we can just find each \(f_i=k(x, \cdot)\), where \(x\) is a point in the input space. Then we can write

\[ \sqrt{S(q, p)}=\sup _{f \in \mathcal{H},\|f\|_{\mathcal{H}^d} \leq 1}\left\{\mathbb{E}_{x \sim q}\left[\mathcal{A}_p f(x)\right], \text { s.t. }\right\} . \] i.e. it is just the same, but we have restricted the function class to be an RKHS. It turns out that we can calculate efficiently in this RKHS, because it gives us some extra structure: \[ \begin{aligned} f(x)&=\langle f(\cdot), k(x, \cdot)\rangle_{\mathcal{H}} && \text{ reproducing property}\\ \nabla_x f(x)&=\left\langle f(\cdot), \nabla_x k(x, \cdot)\right\rangle_{\mathcal{H}} \end{aligned} \] Therefore, we have \[ \mathbb{E}_{x \sim q}\left[\mathcal{A}_p f(x)\right]=\left\langle f(\cdot), \mathbb{E}_{x \sim q}\left[\mathcal{A}_p k(\cdot, x)\right]\right\rangle_{\mathcal{H}}, \] where we shift both the expectation and Stein operator to the kernel function. Then we are working with the kernelized Stein discrepancy. Except… that is not

Let \(k(x, x^{\prime}): \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}\) be a positive definite kernel, the RKHS \(\mathcal{H}\) with kernel \(k\) includes functions of form \(f(x)=\sum_i w_i k\left(x, x_i\right)\), equipped with inner product \(\langle f, g\rangle_{\mathcal{H}}=\sum_{i j} w_i v_j k\left(x_i, x_j\right)\) for \(g=\sum_j v_j k\left(x, x_j\right)\) and RKHS norm \(\|f\|_{\mathcal{H}}^2=\sum_{i j} w_i w_j k\left(x_i, x_j\right)\). We take \(\mathcal{F}\) to be the unit ball in a reproducing kernel Hilbert space (RKHS).

Define \[ \beta_{q, p}(\cdot)=\mathbb{E}_{x' \sim q} \mathcal{A}_p k(\cdot, x') . \]

Finding that crucial supremum is then equivalent to solving \[ \sup _f\left\langle f, \beta_{q, p}\right\rangle_{\mathcal{H}}, \quad \text { s.t. }\|f\|_{\mathcal{H}} \leq 1 . \]

We maximise this if we set \(f=\beta_{q, p} /\left\|\beta_{q, p}\right\|_{\mathcal{H}}\), normalising it to be on the unit ball at the point that maximises the expectation. Thus \[\begin{align} S(q, p) &=\left\|\beta_{q, p}\right\|_{\mathcal{H}}^2$\\ &=\mathbb{E}_{x, x^{\prime} \sim q}\left[\kappa_p\left(x, x^{\prime}\right)\right] \end{align}\] where \[ \kappa_p\left(x, x^{\prime}\right):=\mathcal{A}_p^x \mathcal{A}_p^{x^{\prime}} k\left(x, x^{\prime}\right) . \] Here we defined \(\mathcal{A}_p^x\) and \(\mathcal{A}_p^{x^{\prime}}\) represents the Stein operator w.r.t. variable \(x\) and \(x^{\prime}\), respectively. \(\kappa_p\left(x, x^{\prime}\right)\) is the “Steinalized” kernel obtained by applying Stein operator on \(k\left(x, x^{\prime}\right)\) twice.

It is a mess to write out in full though.

4 Stein Variational Gradient Descent

We manufacture an empirical \(q\) by using a set of particles \(\{x_i\}_{i=1}^n\).

First confusion: The first time I heard about this, as a person staring at Neural Nets too hard for too long, the first confusion is that the Gradient descent here is not SGD where we assimilate gradient steps by looking at examples; it is rather a gradient descent in parameters space which converges in a batched fashion towards a good approximation of the posterior.

A worked example will sort this out.

5 Stochastic variants

(Li et al. 2020; Zhang et al. 2020)

6 Moment matching interpretation

See Q. Liu and Wang (2018).

7 Incoming

8 References

Alsup, Venturi, and Peherstorfer. 2022. Multilevel Stein Variational Gradient Descent with Applications to Bayesian Inverse Problems.” In Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference.
Ambrogioni, Güçlü, Güçlütürk, et al. 2018. Wasserstein Variational Inference.” In Proceedings of the 32Nd International Conference on Neural Information Processing Systems. NIPS’18.
Anastasiou, Barp, Briol, et al. 2022. Stein’s Method Meets Computational Statistics: A Review of Some Recent Developments.”
Chen, Wu, Chen, et al. 2020. Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions.”
Chu, Minami, and Fukumizu. 2022. The Equivalence Between Stein Variational Gradient Descent and Black-Box Variational Inference.” In.
Chwialkowski, Strathmann, and Gretton. 2016. A Kernel Test of Goodness of Fit.” In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48. ICML’16.
Detommaso, Cui, Spantini, et al. 2018. A Stein Variational Newton Method.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.
Detommaso, Hoitzing, Cui, et al. 2019. Stein Variational Online Changepoint Detection with Applications to Hawkes Processes and Neural Networks.” arXiv:1901.07987 [Cs, Stat].
Feng, Wang, and Liu. 2017. Learning to Draw Samples with Amortized Stein Variational Gradient Descent.” In UAI 2017.
Gong, Peng, and Liu. 2019. Quantile Stein Variational Gradient Descent for Batch Bayesian Optimization.” In Proceedings of the 36th International Conference on Machine Learning.
Gorham, and Mackey. 2015. Measuring Sample Quality with Stein’s Method.” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. NIPS’15.
Gorham, Raj, and Mackey. 2020. Stochastic Stein Discrepancies.” arXiv:2007.02857 [Cs, Math, Stat].
Han, and Liu. 2018. Stein Variational Gradient Descent Without Gradient.” In Proceedings of the 35th International Conference on Machine Learning.
Huggins, Campbell, Kasprzak, et al. 2018. Scalable Gaussian Process Inference with Finite-Data Mean and Variance Guarantees.” arXiv:1806.10234 [Cs, Stat].
Ley, Reinert, and Swan. 2017. Stein’s Method for Comparison of Univariate Distributions.” Probability Surveys.
Li, Li, Liu, et al. 2020. A Stochastic Version of Stein Variational Gradient Descent for Efficient Sampling.” Communications in Applied Mathematics and Computational Science.
Liu, Qiang. 2016a. A Short Introduction to Kernelized Stein Discrepancy.”
———. 2016b. Stein Variational Gradient Descent: Theory and Applications.”
———. 2017. Stein Variational Gradient Descent as Gradient Flow.”
Liu, Qiang, Lee, and Jordan. 2016. A Kernelized Stein Discrepancy for Goodness-of-Fit Tests.” In Proceedings of The 33rd International Conference on Machine Learning.
Liu, Qiang, and Wang. 2018. Stein Variational Gradient Descent as Moment Matching.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.
———. 2019. Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” In Advances In Neural Information Processing Systems.
Liu, Chang, and Zhu. 2018. Riemannian Stein Variational Gradient Descent for Bayesian Inference.” Proceedings of the AAAI Conference on Artificial Intelligence.
Liu, Xing, Zhu, Ton, et al. 2022. Grassmann Stein Variational Gradient Descent.” In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics.
Pielok, Bischl, and Rügamer. 2023. Approximate Bayesian Inference with Stein Functional Variational Gradient Descent.” In.
Pulido, and van Leeuwen. 2019. Sequential Monte Carlo with Kernel Embedded Mappings: The Mapping Particle Filter.” Journal of Computational Physics.
Pulido, Van Leeuwen, and Posselt. 2019. Kernel Embedded Nonlinear Observational Mappings in the Variational Mapping Particle Filter.” In Computational Science – ICCS 2019. ICCS 2019. Lecture Notes in Computer Science.
Stordal, Moraes, Raanes, et al. 2021. P-Kernel Stein Variational Gradient Descent for Data Assimilation and History Matching.” Mathematical Geosciences.
Tamang, Ebtehaj, van Leeuwen, et al. 2021. Ensemble Riemannian Data Assimilation over the Wasserstein Space.” Nonlinear Processes in Geophysics.
Wang, Dilin, and Liu. 2019. Nonlinear Stein Variational Gradient Descent for Learning Diversified Mixture Models.” In Proceedings of the 36th International Conference on Machine Learning.
Wang, Ziyu, Ren, Zhu, et al. 2018. Function Space Particle Optimization for Bayesian Neural Networks.” In.
Wang, Dilin, Tang, Bajaj, et al. 2019. Stein Variational Gradient Descent with Matrix-Valued Kernels.” In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
Wang, Dilin, Zeng, and Liu. 2018. Stein Variational Message Passing for Continuous Graphical Models.”
Wen, and Li. 2022. Affine-Mapping Based Variational Ensemble Kalman Filter.” Statistics and Computing.
Xu, and Matsuda. 2021. Interpretable Stein Goodness-of-Fit Tests on Riemannian Manifolds.” arXiv:2103.00895 [Stat].
Zhang, Zhang, Carin, et al. 2020. Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory.” In International Conference on Artificial Intelligence and Statistics.
Zhao, Wang, Zhu, et al. 2023. Stein Variational Gradient Descent with Learned Direction.” Information Sciences.
Zhou, and Qiu. 2023. Augmented Message Passing Stein Variational Gradient Descent.”
Zhuo, Liu, Shi, et al. 2018. Message Passing Stein Variational Gradient Descent.” In Proceedings of the 35th International Conference on Machine Learning.