Nearly sufficient statistics and information bottlenecks
Also, rate distortion by other means, and the measure-valued sufficient statistic
2018-03-13 — 2026-05-26
In Which the Posterior Distribution Is Recognised as the Canonical Measure-Valued Sufficient Statistic, Found to Recur Under Many Names Across Disparate Fields, and Subjected to Compression via the Information Bottleneck.
🏗
I’m working through a small realization, for my own interest, which has helped my understanding of variational Bayes; and specifically in relating it to non-Bayes variational inference. Also sequential Monte Carlo.
Sufficient statistics are the Garden of Edenference. Formally, a statistic \(T\) — any function of the data — is sufficient for \(\theta\) iff the conditional law of the data given \(T\) doesn’t depend on \(\theta\),
\[ p(x\mid T(x),\theta)=p(x\mid T(x)), \]
equivalently the Fisher–Neyman factorisation \(p(x;\theta)=h(x)\,g(T(x);\theta)\). Everything the data say about \(\theta\) is funnelled through \(T\); conditional on \(T\), the rest of \(x\) is noise as far as \(\theta\) is concerned.
Sufficiency is a property of a function of the data; it says nothing about where that function takes values, i.e the codomain of \(T\). Nothing in the definition forces \(T\) to be small, finite-dimensional, or even a vector. The dream, though, is that it is: a fixed-size summary we can carry around in place of the full dataset. In the easy case the codomain is \(\mathbb{R}^k\) with \(k\) fixed — a vector summary, our data is infinitely compressible.
Turns out, Edenference is small. By Pitman–Koopman–Darmois (Pitman 1936; Koopman 1936; Darmois 1935), a sufficient statistic whose dimension stays fixed as the sample size \(n\) grows essentially characterises exponential families. Under even mild generalizations, things are not so nice. In mixtures, hierarchical models, anything modern including and especially neural networks — there is no fixed-size \(T\) that is exactly sufficient and which characterizes the posterior.
1 The sufficient statistic is a distribution
In general, the natural codomain of \(T\) is the space of probability measures on \(\Theta\), and the canonical sufficient statistic is the likelihood function, or equivalently in Bayes, the posterior distribution over \(\theta\) given \(x\). This still is a sufficient statistic, if we let \(T\) live somewhere bigger than \(\mathbb{R}^k\). The posterior \(\theta \mapsto p(\theta\mid x)\) is itself a function of the data — a statistic in the original sense — it just takes values in the space of probability measures on \(\Theta\) rather than in \(\mathbb{R}^k\). Bayesian sufficiency (Halmos and Savage 1949; Bahadur 1954) in fact says precisely this: \(T\) is sufficient iff, for every prior, the posterior depends on \(x\) only through \(T\) — so the posterior measure is itself sufficient, tautologically.
For prediction the analogue is the conditional law of the future given the past; Knight’s prediction process (Knight 1975) is the minimal Markov sufficient statistic for any process.
Now, it turns out measure-valued sufficient statistics pop up a lot; I hadn’t realised how many places they occur, despite seeing many objects that were remarkably similar. Most recently, I was writing up POMDP belief states and I was all like, damn that looks like a sufficient statistic — why does it have a funny name? A quick search via my friendly neighbourhood LLM revealed quite a few other places where the same object appears, under different names:
| Field | Name it goes by | The object |
|---|---|---|
| Bayesian statistics | posterior | \(p(\theta\mid x)\); sufficient for every prior (Halmos and Savage 1949; Bahadur 1954) |
| Sequential analysis, Bayesian bandits | belief, Bayes-adaptive state | posterior over the unknown parameter; the state of the Gittins / BAMDP problem |
| Nonlinear filtering, HMMs | the filter, conditional law | \(\pi_t=\mathbb{P}(X_t\in\cdot\mid\mathcal{Y}_t)\) (Kushner 1964; Stratonovich 1960; Zakai 1969); collapses to mean + covariance in the Kalman / exponential-family case |
| Stochastic control | information state, hyperstate | (Åström 1965; Striebel 1965; Kumar and Varaiya 1986); the separation principle acts on it |
| POMDPs, RL | belief state | \(b_t\in\Delta(\mathcal{S})\) — see the belief-MDP reduction |
| Stochastic processes | prediction process | (Knight 1975); the canonical measure-valued Markov sufficient statistic |
| Computational mechanics | causal state, ε-machine | (Shalizi and Crutchfield 2000); the minimal predictively-sufficient partition of histories |
| Graphical models | message, belief, mean parameter | a belief-propagation message is a sufficient statistic for its subtree’s evidence; EP projects it onto an exponential family (Minka 2001; Wainwright and Jordan 2008) |
| Predictive state representations | predictive state | (LittmanSutton2001Predictive?); sufficiency stated over future observable tests rather than latent state |
| Exchangeability | de Finetti mixing measure | the latent “parameter” is itself a random distribution |
To reprise PKD: call it what you want, outside an exponential family that measure has no fixed finite parameterization, so an exactly sufficient finite statistic is off the table.
It seems we manage to act in the world without carrying around a copy of the world in our heads; so if we are doing something like statistical analysis of the world, we must be making some approximations. The question is how cheaply we can approximate that measure while losing as little as we can afford to (whatever that, in context, means).
2 The information bottleneck
Information bottlenecks are one way of formalising this finite-budget summary of a measure-valued sufficient statistic. I ransacked S. Hu et al. (2024) for a recent summary of developments in information bottleneck theory. Unless otherwise stated, the notation and results come from there.
We pick a “relevance” variable \(Y\) that encodes what matters—future observations, labels, a function of \(\theta\), a decision, or the next state in a filter. Then we find a (hopefully compact) representation \(T\) of \(X\) that trades off compressing \(X\) against preserving information about \(Y\):
\[ \min_{p(t\mid x)}\; I(T;X)\;-\;\beta\, I(T;Y). \]
This is a specialization of rate–distortion, where the distortion is chosen to be the Kullback–Leibler mismatch in predicting \(Y\) from \(T\), and \(\beta\!\!>\!0\) is our trade‑off knob, i.e. rate–distortion with an informational metric.
Payoffs:
- If we set \(Y=\theta\) (or a function of it), the optimal \(T\) is an approximately minimal sufficient statistic for inference about \(\theta\).
- If we set \(Y=X_{t+1}\) (or a future payoff) in a filtering/control problem, we get a predictive state: a compressed memory \(T_t\) of the past that’s maximally useful for predicting what we need next.
2.1 “Nearly sufficient” IB
There are at least two ways to be “nearly sufficient” in the IB context, depending on what we want \(Y\) to do.
Information sufficiency (task‑conditional independence). Define \(T\) to be \(Y\)-sufficient if \(I(Y;X\mid T)=0\). Then “\(\varepsilon\)-sufficient” means \(I(Y;X\mid T)\le \varepsilon\). IB is the convex relaxation that searches for \(T\) with small \(I(T;X)\) subject to small \(I(Y;X\mid T)\), via a Lagrange multiplier \(\beta\).
Decision sufficiency. For a loss \(\ell(a,y)\), call \(T\) \(\varepsilon\)-sufficient if the Bayes risk gap using \(T\) instead of \(X\) is \(\le \varepsilon\). For log‑loss this is equivalent to #1; for other losses it’s stricter or looser. (This is where rate–distortion with a bespoke distortion can be a better fit than vanilla IB if the task isn’t log‑loss‑like.)
Pick the notion that matches our downstream risk. If we’re doing Bayesian filtering or SMC, #1 with \(Y\!=\!X_{t+1}\) or \(Y\!=\!\) future rewards is what we want. On the other hand, if we’re doing parameter estimation, #1 with \(Y\!=\!\theta\) (or a function like \(g(\theta)\)) is usually the cleanest.
2.2 Connection to Variational Bayes
The “deep variational IB” (DVIB) parametrises \(p(t\mid x)\) (the encoder) and works with a variational decoder \(q(y\mid t)\) plus a prior \(r(t)\). The training objective
\[ \max\; \mathbb{E}_{p(t\mid x)}[\log q(y\mid t)]\;-\;\beta\,\mathrm{KL}\big(p(t\mid x)\,\Vert\, r(t)\big) \]
Here’s the IB Lagrangian rewritten so we can optimize it with gradients. With \(Y=X\) and \(\beta\!=\!1\), we recover the VAE ELBO; that’s not an accident. ELBO can be seen as an IB objective where the “relevant” variable is the input itself; DVIB is ELBO with the relevance set to our task.
2.3 Filtering and state compression
In Bayes filtering we want a recursively updated \(T_t\) that is small but “task‑sufficient”. In exponential families (e.g. linear-Gaussian) filters exist, e.g. the Kalman filter.
IB gives us a target for awkward models: minimize \(I(T_t; X_{1:t})\) while maximising \(I(T_t; Y_t)\), where \(Y_t\) is whatever we care about next (the next observation, the latent state, the next reward). We can then:
- Parameter‑tracking: We set \(Y_t=\theta\) (or \(g(\theta)\)), turning \(T_t\) into a rolling compressed posterior proxy.
- Predictive control: We can set \(Y_t=X_{t+1}\) or a function of \(X_{t+1}\) and action \(A_t\).
2.4 In neural nets
Each layer in a trained DNN behaves approximately like a nearly sufficient statistic for its task variable, which is a helpful mental picture even if we don’t care about DNNs. The classical IB principle was defined as an optimization problem over encoders \(p(t|x)\): compress \(X\) into \(T\) while keeping as much information about \(Y\) as possible,
\[ \min_{p(t|x)} I(T;X) - \beta I(T;Y). \]
That’s cool, but impractical as written because neural networks are large and complicated. Deep Information Bottleneck parameterizes the encoder and decoder with neural nets and trains them using gradient methods.
- Encoder: \(p_\theta(t|x)\) (often Gaussian; mean and variance predicted by a neural net).
- Decoder: \(q_\phi(y|t)\), a variational approximation to \(p(y|t)\).
- Prior: \(r(t)\), usually standard Gaussian, to make the KL term computable.
So the training objective becomes:
\[ \max_{\theta,\phi}\; \mathbb{E}_{p_\theta(t|x)}[\log q_\phi(y|t)] - \beta\, \text{KL}(p_\theta(t|x)\|r(t)). \]
That’s the Deep Variational IB (DVIB) objective. If we set \(Y=X\), we get the VAE ELBO; if we set \(Y\) to be labels, we get a bottleneck tuned for classification.
The KL penalty is the mechanism of “compression”. It forces the encoder not to keep every pixel of \(X\) but only the degrees of freedom that actually help predict \(Y\).
This gives the encoder a concrete way to approximate minimal sufficient statistics for \(Y\), even in messy, high-dimensional settings.
Shwartz-Ziv and Tishby (2017) famously plotted layer-wise mutual information during NN training (I forget now if they were using IB-aware training, or if this holds generally — TODO check). In their experiments each hidden layer \(T_i\) follows a path in the “information plane” \((I(T;X), I(T;Y))\):
- Phase 1 (fitting): mutual information with \(Y\) shoots up as the layers learn to predict.
- Phase 2 (compression): \(I(T;X)\) falls slowly — the representation forgets irrelevant details while keeping \(I(T;Y)\) high.
The two phases correspond to what we see if we measure gradient statistics: a drift phase (large, directed updates) followed by a diffusion phase (random fluctuations dominate). The diffusion adds noise that automatically compresses representations.
I think this means that in the Deep IB view, SGD itself is a bottleneck machine.* Layers end up very close to the IB theoretical bound, which characterises optimal trade-offs between memory \(I(T;X)\) and relevance \(I(T;Y)\).
2.5 Non-convexity
There are two competing terms in the functional
\[ \min_{p(t|x)} I(T;X) - \beta I(T;Y). \]
\(I(T;X)\) is convex in \(p(t|x)\), while \(I(T;Y)\) is concave in \(p(t|x)\). Their weighted difference is not convex. This means we don’t get the nice guarantees we’d have in rate–distortion theory with convex distortion measures. Instead, the IB Lagrangian can have multiple local minima.
The iterative IB algorithm (the original Blahut–Arimoto-style update rules from Tishby et al. 1999) only converges to a fixed point of the self-consistent equations, not necessarily the global optimum. This is why different runs with different initialisations can produce different clusterings or partitions of the data.
- Classical IB (discrete case): iterative IB converges to locally optimal solutions; sometimes multiple solutions lie on the same “information curve,” but we don’t know if we’ve found the best.
- Deep IB (DVIB): we’ve compounded the problem: now the encoder/decoder are parameterized by neural nets, so we inherit both the non-convexity of the IB objective and the non-convexity of deep learning. Optimizers (SGD, Adam) will tend to settle in some basin that’s “good enough” but not provably optimal.
- Information-plane diagnostics (Schwartz-Ziv & Tishby): One of the reasons they use mutual information plots is to check whether the layers land near the theoretical IB bound. But the fact that different training runs converge to slightly different places reflects the non-convexity.
Researchers have tried to tame this in several ways:
Convex relaxations / surrogates: Variants like the Hilbert–Schmidt independence criterion bottleneck (HSIC-IB) replace mutual information with a kernel-based dependence measure, which yields a convex or easier objective for optimization.
Variational bounds: DVIB itself is a relaxation — we replace intractable mutual information terms with variational lower bounds. These are optimizable with SGD, but they don’t restore convexity; they just make the problem numerically tractable.
Deficiency and f-divergence bottlenecks: Alternatives like the variational deficiency bottleneck (VDB) swap in divergences with better optimization landscapes in some cases.
Annealing β / continuation: A common heuristic is to slowly increase \(\beta\), following the IB curve and avoiding bad local minima.
3 Other nearly-sufficient statistics approaches
Rate–Distortion (RD). Classic source coding. We choose a distortion \(d\) and find the minimal rate \(R(D)\). When our “task” is truly captured by a metric on \(x\)–\(t\) pairs, RD is the cleanest. IB is RD with a data‑driven distortion that says: “Two inputs are close if they induce similar posteriors over \(Y\).”
Coresets / inducing sets. These reduce dataset size rather than feature dimension: we keep a small weighted subset that approximates the full objective. Great for bounded memory and streaming, orthogonal to the IB trade‑off — we can do IB on top of a coreset. See data summarization
MDL (Minimum Description Length). Another compression‑flavoured philosophy. MDL trades codelength of model plus residuals; IB trades mutual information for a chosen relevance. It’s related to Kolmogorov complexity.
4 Also singular learning theory
See Singular learning theory for an implicit approach to “nearly sufficient” statistics, which is more focused on the geometry of the likelihood function and the behaviour of the marginal likelihood in singular models. This comes out rather different for unclear reason.

