Nearly sufficient statistics and information bottlenecks
Also, rate distortion
2018-03-13 — 2025-09-11
🏗
I’m working through a small realisation, for my own interest, which has been helpful in my understanding of variational Bayes; specifically relating it to non-Bayes variational inference. Also sequential monte carlo.
Sufficient statistics are the Garden of Edenference: we compress the data to a low‑dimensional \(T(x)\) and lose nothing that matters estimating \(\theta\). Formally, \(T\) is sufficient iff the conditional law of the data given \(T\) is parameter‑free,
\[ p(x\mid T(x),\theta)=p(x\mid T(x)), \]
or equivalently, the Fisher–Neyman factorization \(p(x;\theta)=h(x)\,g(T(x);\theta)\).
Eden is small. By the Pitman–Koopman–Darmois theorem, fixed‑dimensional sufficiency across all sample sizes essentially requires (let’s say, _characterizes) exponential families. Outside Eden—even for simple parametric models, mixtures, and anything modern including and especially neural networks—we should not expect a finite‑dimensional \(T\) that is exactly sufficient. If we try to make such a thing, we might find \(T\) grows linearly \(n\), in which we bought ourselves a glorified copy of the dataset, and our statistics are not really “compressing” knowledge in the dataset.
So, there is no sufficient statistic in most interesting cases. But how closely can we approach that ideal while loosing as little information as possible? Information bottlenecks are one way of formalising this.
1 The information bottleneck
I ransacked S. Hu et al. (2024) for a recent summary of action in the information bottleneck theory. Unless otherwise stated, notation and results come from there.
We pick a “relevance” variable \(Y\) that encodes what matters—future observations, labels, a function of \(\theta\), a decision, the next state in a filter. Then, we find a (hopefully compact) representation \(T\) of \(X\) that trades off compressing \(X\) against preserving information about \(Y\):
\[ \min_{p(t\mid x)}\; I(T;X)\;-\;\beta\, I(T;Y). \]
This is a specialization of rate–distortion, where the distortion is chosen to be the Kullback–Leibler mismatch in predicting \(Y\) from \(T\), and \(\beta\!\!>\!0\) is our trade‑off knob, i.e. rate-distortion with an informational metric.
Payoffs:
- If we set \(Y=\theta\) (or a function of it), the optimal \(T\) is an approximately minimal sufficient statistic for inference about \(\theta\).
- If you set \(Y=X_{t+1}\) (or a future payoff) in a filtering/control problem, you get a predictive state: a compressed memory \(T_t\) of the past that is maximally useful for predicting what we need next.
1.1 “Nearly sufficient” IB
There are at least three ways to say “nearly sufficient” depeneding on what we want \(Y\) to do.
Information sufficiency (task‑conditional independence). Define \(T\) to be \(Y\)-sufficient if \(I(Y;X\mid T)=0\). Then “\(\varepsilon\)-sufficient” means \(I(Y;X\mid T)\le \varepsilon\). IB is the convex relaxation that searches for \(T\) with small \(I(T;X)\) subject to small \(I(Y;X\mid T)\), via a Lagrange multiplier \(\beta\).
Decision sufficiency. For a loss \(\ell(a,y)\), call \(T\) \(\varepsilon\)-sufficient if the Bayes risk gap using \(T\) instead of \(X\) is \(\le \varepsilon\). For log‑loss this is equivalent to #1; for other losses it’s stricter or looser. (This is where rate–distortion with a bespoke distortion can be a better fit than vanilla IB if the task is not log‑loss‑like.)
Pick the notion that matches your downstream risk. If you’re doing Bayesian filtering or SMC, #1 with \(Y\!=\!X_{t+1}\) or \(Y\!=\!\) future rewards is the right primitive; if you’re doing parameter estimation, #1 with \(Y\!=\!\theta\) (or a function like \(g(\theta)\)) is usually the cleanest.
2 Connection to Variational Bayes
The “deep variational IB” (DVIB) parameterises \(p(t\mid x)\) (the encoder) and works with a variational decoder \(q(y\mid t)\) plus a prior \(r(t)\). The training objective
\[ \max\; \mathbb{E}_{p(t\mid x)}[\log q(y\mid t)]\;-\;\beta\,\mathrm{KL}\big(p(t\mid x)\,\Vert\, r(t)\big) \]
is the IB Lagrangian rewritten so we can optimise it with gradients. With \(Y=X\) and \(\beta\!=\!1\), we recover the VAE ELBO; that’s not an accident. ELBO can be seen an IB objective where the “relevant” variable is the input itself; DVIB is ELBO with the relevance set to our task.
2.1 Filtering and state compression
In Bayes filtering we want a recursively updated \(T_t\) that is small but “task‑sufficient”. In exponential families (e.g. linear-Gaussian) filters exist, e.g. the Kalman filter.
IB gives us a target for awkward nodels: minimise \(I(T_t; X_{1:t})\) while maximising \(I(T_t; Y_t)\), where \(Y_t\) is whatever we care about next (the next observation, the latent state, the next reward). You can then:
- Parameter‑tracking: let \(Y_t=\theta\) (or \(g(\theta)\)), turning \(T_t\) into a rolling compressed posterior proxy.
- Predictive control: let \(Y_t=X_{t+1}\) or a function of \(X_{t+1}\) and action \(A_t\).
2.2 In Neural nets
Each layer in a trained DNN behaves approximately like a nearly sufficient statistic way for its task variable, which is a helpful mental picture even if you don’t care about DNNs. The classical IB principle was defined as an optimization problem on encoders \(p(t|x)\): compress \(X\) into \(T\) while keeping as much about \(Y\) as possible,
\[ \min_{p(t|x)} I(T;X) - \beta I(T;Y). \]
That’s clean, but impractical in raw form because NNs big etc. Deep Information Bottleneck parameterizes that encoder and decoder with neural nets, and trains them with gradient methods.
- Encoder: \(p_\theta(t|x)\) (often Gaussian, mean/variance predicted by a neural net).
- Decoder: \(q_\phi(y|t)\), a variational approximation to \(p(y|t)\).
- Prior: \(r(t)\), usually standard Gaussian, to make the KL term computable.
So the training objective becomes
\[ \max_{\theta,\phi}\; \mathbb{E}_{p_\theta(t|x)}[\log q_\phi(y|t)] - \beta\, \text{KL}(p_\theta(t|x)\|r(t)). \]
That’s the Deep Variational IB (DVIB) objective. If we set \(Y=X\), we get the VAE ELBO; if we set \(Y\) to be labels, you get a bottleneck tuned for classification.
The KL penalty is the mechanism of “compression.” It forces the encoder not to carry every pixel of \(X\), but only those degrees of freedom that actually help predict \(Y\). In practice it behaves like information-aware dropout: we add noise to our hidden representation until only the task-relevant features survive.
This gives the encoder a concrete way to approximate minimal sufficient statistics for \(Y\), even in messy, high-dimensional settings.
(ShwartzZiv2017Openinga?) famously plotted layer-wise mutual informations during training. In their experiments every hidden layer \(T_i\) follows a path in the “information plane” \((I(T;X), I(T;Y))\):
- Phase 1 (fitting): mutual info with \(Y\) shoots up as layers learn to predict.
- Phase 2 (compression): \(I(T;X)\) falls slowly — the representation forgets irrelevant details, while keeping \(I(T;Y)\) high.
The two phases correspond to what we see if you measure gradient statistics: a drift phase (large, directed updates) followed by a diffusion phase (random fluctuations dominate). The diffusion adds noise that automatically compresses representations.
So in the Deep IB view, SGD itself is a bottleneck machine. Layers end up very close to the IB theoretical bound which characterizes optimal trade-offs between memory \(I(T;X)\) and relevance \(I(T;Y)\).
2.3 Non-convexity
There are two competing terms in the functional
\[ \min_{p(t|x)} I(T;X) - \beta I(T;Y). \]
\(I(T;X)\) is convex in \(p(t|x)\), while \(I(T;Y)\) is concave in \(p(t|x)\). Their weighted difference is not convex. This means we don’t get the nice guarantees we’d have in rate–distortion theory with convex distortion measures. Instead, the IB Lagrangian can have multiple local minima.
The iterative IB algorithm (the original Blahut–Arimoto-style update rules from Tishby et al. 1999) only converges to a fixed point of the self-consistent equations, not necessarily the global optimum. This is exactly why different runs with different initialisations can give different clusterings or partitions of the data.
- Classical IB (discrete case): iterative IB converges to locally optimal solutions; sometimes multiple solutions lie on the same “information curve,” but you don’t know if you’ve found the best.
- Deep IB (DVIB): you’ve compounded the problem: now the encoder/decoder are parameterised by neural nets, so you inherit both the non-convexity of the IB objective and the non-convexity of deep learning. Optimisers (SGD, Adam) will settle in some basin that’s “good enough” but not provably optimal.
- Information-plane diagnostics (Schwartz-Ziv & Tishby): one of the reasons they use mutual information plots is to check whether the layers land near the theoretical IB bound. But the fact that different trainings converge to slightly different places reflects the non-convexity.
Researchers have tried to tame this in several ways:
Convex relaxations / surrogates: Variants like the Hilbert–Schmidt independence criterion bottleneck (HSIC-IB) replace mutual information with a kernel-based dependence measure, which yields a convex or easier objective for optimisation.
Variational bounds: DVIB itself is a relaxation — you replace intractable mutual information terms with variational lower bounds. These are optimisable with SGD, but they don’t restore convexity, they just make the problem numerically tractable.
Deficiency and f-divergence bottlenecks: Alternatives like the variational deficiency bottleneck (VDB) swap in divergences with better optimisation landscapes in some cases.
Annealing β / continuation: A common heuristic is to slowly increase \(\beta\), following the IB curve and avoiding bad local minima.
3 Other nearly-sufficient statistics approaches
Rate–Distortion (RD). Classic source coding. We choose a distortion \(d\) and find the minimal rate \(R(D)\). When our “task” is truly captured by a metric on \(x\)–\(t\) pairs, RD is the cleanest. IB is RD with a data‑driven distortion that says: “Two inputs are close if they induce similar posteriors over \(Y\).”
Coresets / inducing sets.. These reduce dataset size rather than feature dimension: you keep a small weighted subset that approximates the full objective. Great for bounded memory and streaming, orthogonal to the IB trade‑off — we can do IB on top of a coreset. See data summarization
MDL (Minimum Description Length). Another compression‑flavoured philosophy. MDL trades codelength of model plus residuals; IB trades mutual information for a chosen relevance. Something something Kolmogorov complexity.