Nearly sufficient statistics and information bottlenecks

Also, rate distortion

2018-03-13 — 2025-09-11

approximation
estimator distribution
functional analysis
information
linear algebra
metrics
model selection
optimization
probabilistic algorithms
probability
sparser than thou
statistics
uncertainty
Figure 1

🏗

I’m working through a small realisation, for my own interest, which has been helpful in my understanding of variational Bayes; specifically relating it to non-Bayes variational inference. Also sequential monte carlo.

Sufficient statistics are the Garden of Edenference: we compress the data to a low‑dimensional \(T(x)\) and lose nothing that matters estimating \(\theta\). Formally, \(T\) is sufficient iff the conditional law of the data given \(T\) is parameter‑free,

\[ p(x\mid T(x),\theta)=p(x\mid T(x)), \]

or equivalently, the Fisher–Neyman factorization \(p(x;\theta)=h(x)\,g(T(x);\theta)\).

Eden is small. By the Pitman–Koopman–Darmois theorem, fixed‑dimensional sufficiency across all sample sizes essentially requires (let’s say, _characterizes) exponential families. Outside Eden—even for simple parametric models, mixtures, and anything modern including and especially neural networks—we should not expect a finite‑dimensional \(T\) that is exactly sufficient. If we try to make such a thing, we might find \(T\) grows linearly \(n\), in which we bought ourselves a glorified copy of the dataset, and our statistics are not really “compressing” knowledge in the dataset.

So, there is no sufficient statistic in most interesting cases. But how closely can we approach that ideal while loosing as little information as possible? Information bottlenecks are one way of formalising this.

1 The information bottleneck

Figure 2: Development of information bottleneck through history (S. Hu et al. 2024)

I ransacked S. Hu et al. (2024) for a recent summary of action in the information bottleneck theory. Unless otherwise stated, notation and results come from there.

Information bottleneck reframes “find a nearly sufficient statistic” as a variational problem: choose a relevance variable \(Y\) (what matters), then learn a representation \(T(X)\) that retains information about \(Y\) while discarding information about \(X\). Formally, minimise \(I(T;X)-\beta I(T;Y)\). With \(Y=\theta\) this yields approximate minimal sufficient statistics for inference; with \(Y=X_{t+1}\) it yields predictive state for filtering. T his is a special case of rate–distortion with a data‑driven distortion and gives a principled meaning to “nearly sufficient”: small \(I(Y;X\mid T)\) at controlled memory \(I(T;X)\).

We pick a “relevance” variable \(Y\) that encodes what matters—future observations, labels, a function of \(\theta\), a decision, the next state in a filter. Then, we find a (hopefully compact) representation \(T\) of \(X\) that trades off compressing \(X\) against preserving information about \(Y\):

\[ \min_{p(t\mid x)}\; I(T;X)\;-\;\beta\, I(T;Y). \]

This is a specialization of rate–distortion, where the distortion is chosen to be the Kullback–Leibler mismatch in predicting \(Y\) from \(T\), and \(\beta\!\!>\!0\) is our trade‑off knob, i.e. rate-distortion with an informational metric.

Payoffs:

  • If we set \(Y=\theta\) (or a function of it), the optimal \(T\) is an approximately minimal sufficient statistic for inference about \(\theta\).
  • If you set \(Y=X_{t+1}\) (or a future payoff) in a filtering/control problem, you get a predictive state: a compressed memory \(T_t\) of the past that is maximally useful for predicting what we need next.

1.1 “Nearly sufficient” IB

There are at least three ways to say “nearly sufficient” depeneding on what we want \(Y\) to do.

  1. Information sufficiency (task‑conditional independence). Define \(T\) to be \(Y\)-sufficient if \(I(Y;X\mid T)=0\). Then “\(\varepsilon\)-sufficient” means \(I(Y;X\mid T)\le \varepsilon\). IB is the convex relaxation that searches for \(T\) with small \(I(T;X)\) subject to small \(I(Y;X\mid T)\), via a Lagrange multiplier \(\beta\).

  2. Decision sufficiency. For a loss \(\ell(a,y)\), call \(T\) \(\varepsilon\)-sufficient if the Bayes risk gap using \(T\) instead of \(X\) is \(\le \varepsilon\). For log‑loss this is equivalent to #1; for other losses it’s stricter or looser. (This is where rate–distortion with a bespoke distortion can be a better fit than vanilla IB if the task is not log‑loss‑like.)

Pick the notion that matches your downstream risk. If you’re doing Bayesian filtering or SMC, #1 with \(Y\!=\!X_{t+1}\) or \(Y\!=\!\) future rewards is the right primitive; if you’re doing parameter estimation, #1 with \(Y\!=\!\theta\) (or a function like \(g(\theta)\)) is usually the cleanest.

2 Connection to Variational Bayes

The “deep variational IB” (DVIB) parameterises \(p(t\mid x)\) (the encoder) and works with a variational decoder \(q(y\mid t)\) plus a prior \(r(t)\). The training objective

\[ \max\; \mathbb{E}_{p(t\mid x)}[\log q(y\mid t)]\;-\;\beta\,\mathrm{KL}\big(p(t\mid x)\,\Vert\, r(t)\big) \]

is the IB Lagrangian rewritten so we can optimise it with gradients. With \(Y=X\) and \(\beta\!=\!1\), we recover the VAE ELBO; that’s not an accident. ELBO can be seen an IB objective where the “relevant” variable is the input itself; DVIB is ELBO with the relevance set to our task.

2.1 Filtering and state compression

In Bayes filtering we want a recursively updated \(T_t\) that is small but “task‑sufficient”. In exponential families (e.g. linear-Gaussian) filters exist, e.g. the Kalman filter.

IB gives us a target for awkward nodels: minimise \(I(T_t; X_{1:t})\) while maximising \(I(T_t; Y_t)\), where \(Y_t\) is whatever we care about next (the next observation, the latent state, the next reward). You can then:

  • Parameter‑tracking: let \(Y_t=\theta\) (or \(g(\theta)\)), turning \(T_t\) into a rolling compressed posterior proxy.
  • Predictive control: let \(Y_t=X_{t+1}\) or a function of \(X_{t+1}\) and action \(A_t\).

2.2 In Neural nets

Figure 3: Mutual information evolution in DNNs (S. Hu et al. 2024)

Each layer in a trained DNN behaves approximately like a nearly sufficient statistic way for its task variable, which is a helpful mental picture even if you don’t care about DNNs. The classical IB principle was defined as an optimization problem on encoders \(p(t|x)\): compress \(X\) into \(T\) while keeping as much about \(Y\) as possible,

\[ \min_{p(t|x)} I(T;X) - \beta I(T;Y). \]

That’s clean, but impractical in raw form because NNs big etc. Deep Information Bottleneck parameterizes that encoder and decoder with neural nets, and trains them with gradient methods.

  • Encoder: \(p_\theta(t|x)\) (often Gaussian, mean/variance predicted by a neural net).
  • Decoder: \(q_\phi(y|t)\), a variational approximation to \(p(y|t)\).
  • Prior: \(r(t)\), usually standard Gaussian, to make the KL term computable.

So the training objective becomes

\[ \max_{\theta,\phi}\; \mathbb{E}_{p_\theta(t|x)}[\log q_\phi(y|t)] - \beta\, \text{KL}(p_\theta(t|x)\|r(t)). \]

That’s the Deep Variational IB (DVIB) objective. If we set \(Y=X\), we get the VAE ELBO; if we set \(Y\) to be labels, you get a bottleneck tuned for classification.

The KL penalty is the mechanism of “compression.” It forces the encoder not to carry every pixel of \(X\), but only those degrees of freedom that actually help predict \(Y\). In practice it behaves like information-aware dropout: we add noise to our hidden representation until only the task-relevant features survive.

This gives the encoder a concrete way to approximate minimal sufficient statistics for \(Y\), even in messy, high-dimensional settings.

(ShwartzZiv2017Openinga?) famously plotted layer-wise mutual informations during training. In their experiments every hidden layer \(T_i\) follows a path in the “information plane” \((I(T;X), I(T;Y))\):

  • Phase 1 (fitting): mutual info with \(Y\) shoots up as layers learn to predict.
  • Phase 2 (compression): \(I(T;X)\) falls slowly — the representation forgets irrelevant details, while keeping \(I(T;Y)\) high.

The two phases correspond to what we see if you measure gradient statistics: a drift phase (large, directed updates) followed by a diffusion phase (random fluctuations dominate). The diffusion adds noise that automatically compresses representations.

So in the Deep IB view, SGD itself is a bottleneck machine. Layers end up very close to the IB theoretical bound which characterizes optimal trade-offs between memory \(I(T;X)\) and relevance \(I(T;Y)\).

2.3 Non-convexity

There are two competing terms in the functional

\[ \min_{p(t|x)} I(T;X) - \beta I(T;Y). \]

\(I(T;X)\) is convex in \(p(t|x)\), while \(I(T;Y)\) is concave in \(p(t|x)\). Their weighted difference is not convex. This means we don’t get the nice guarantees we’d have in rate–distortion theory with convex distortion measures. Instead, the IB Lagrangian can have multiple local minima.

The iterative IB algorithm (the original Blahut–Arimoto-style update rules from Tishby et al. 1999) only converges to a fixed point of the self-consistent equations, not necessarily the global optimum. This is exactly why different runs with different initialisations can give different clusterings or partitions of the data.

  • Classical IB (discrete case): iterative IB converges to locally optimal solutions; sometimes multiple solutions lie on the same “information curve,” but you don’t know if you’ve found the best.
  • Deep IB (DVIB): you’ve compounded the problem: now the encoder/decoder are parameterised by neural nets, so you inherit both the non-convexity of the IB objective and the non-convexity of deep learning. Optimisers (SGD, Adam) will settle in some basin that’s “good enough” but not provably optimal.
  • Information-plane diagnostics (Schwartz-Ziv & Tishby): one of the reasons they use mutual information plots is to check whether the layers land near the theoretical IB bound. But the fact that different trainings converge to slightly different places reflects the non-convexity.

Researchers have tried to tame this in several ways:

  • Convex relaxations / surrogates: Variants like the Hilbert–Schmidt independence criterion bottleneck (HSIC-IB) replace mutual information with a kernel-based dependence measure, which yields a convex or easier objective for optimisation.

  • Variational bounds: DVIB itself is a relaxation — you replace intractable mutual information terms with variational lower bounds. These are optimisable with SGD, but they don’t restore convexity, they just make the problem numerically tractable.

  • Deficiency and f-divergence bottlenecks: Alternatives like the variational deficiency bottleneck (VDB) swap in divergences with better optimisation landscapes in some cases.

  • Annealing β / continuation: A common heuristic is to slowly increase \(\beta\), following the IB curve and avoiding bad local minima.

3 Other nearly-sufficient statistics approaches

  • Rate–Distortion (RD). Classic source coding. We choose a distortion \(d\) and find the minimal rate \(R(D)\). When our “task” is truly captured by a metric on \(x\)\(t\) pairs, RD is the cleanest. IB is RD with a data‑driven distortion that says: “Two inputs are close if they induce similar posteriors over \(Y\).”

  • Coresets / inducing sets.. These reduce dataset size rather than feature dimension: you keep a small weighted subset that approximates the full objective. Great for bounded memory and streaming, orthogonal to the IB trade‑off — we can do IB on top of a coreset. See data summarization

  • MDL (Minimum Description Length). Another compression‑flavoured philosophy. MDL trades codelength of model plus residuals; IB trades mutual information for a chosen relevance. Something something Kolmogorov complexity.

4 References

Adragni, and Cook. 2009. Sufficient Dimension Reduction and Prediction in Regression.” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences.
Alemi, Fischer, Dillon, et al. 2019. Deep Variational Information Bottleneck.”
Bachem, Lucic, and Krause. 2015. Coresets for Nonparametric Estimation - the Case of DP-Means.” In International Conference on Machine Learning.
Baez. 2011. Renyi Entropy and Free Energy.”
Barron, A., Rissanen, and Yu. 1998. The Minimum Description Length Principle in Coding and Modeling.” IEEE Transactions on Information Theory.
Barron, Andrew, Schervisch, and Wasserman. 1999. The Consistency of Posterior Distributions in Nonparametric Problems.” The Annals of Statistics.
Bartlett, Eckford, Egbert, et al. 2025. Physics of Life: Exploring Information as a Distinctive Feature of Living Systems.” PRX Life.
Blei, Kucukelbir, and McAuliffe. 2017. Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association.
Broderick, Boyd, Wibisono, et al. 2013. Streaming Variational Bayes.” In Advances in Neural Information Processing Systems 26.
Cortes, Kuznetsov, Mohri, et al. 2016. Structured Prediction Theory Based on Factor Graph Complexity.” In Advances in Neural Information Processing Systems 29.
Cutajar, Bonilla, Michiardi, et al. 2017. Random Feature Expansions for Deep Gaussian Processes.” In PMLR.
de Castro, and Dorigo. 2019. INFERNO: Inference-Aware Neural Optimisation.” Computer Physics Communications.
Dezfouli, and Bonilla. 2015. Scalable Inference for Gaussian Process Models with Black-Box Likelihoods.” In Advances in Neural Information Processing Systems 28. NIPS’15.
Dimitrakakis, Nelson, Zhang, et al. 2013. Bayesian Differential Privacy Through Posterior Sampling.” arXiv:1306.1066 [Cs, Stat].
Edwards, and Storkey. 2017. Towards a Neural Statistician.” In Proceedings of ICLR.
Elidan, and Friedman. 2005. “Learning Hidden Variable Networks: The Information Bottleneck Approach.” Journal of Machine Learning Research.
Friedman, Mosenzon, Slonim, et al. 2001. Multivariate Information Bottleneck.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.
Gribonval, Blanchard, Keriven, et al. 2017. Compressive Statistical Learning with Random Feature Moments.” arXiv:1706.07180 [Cs, Math, Stat].
Grünwald. 2004. A Tutorial Introduction to the Minimum Description Length Principle.” Advances in Minimum Description Length: Theory and Applications.
Hinton, and van Camp. 1993. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT ’93.
Honkela, and Valpola. 2004. Variational Learning and Bits-Back Coding: An Information-Theoretic View to Bayesian Learning.” IEEE Transactions on Neural Networks.
Huggins, Adams, and Broderick. 2017. PASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In Advances in Neural Information Processing Systems 30.
Hu, Shizhe, Lou, Yan, et al. 2024. A Survey on Information Bottleneck.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Hu, Shell Xu, Moreno, Lawrence, et al. 2018. “Β-BNN: A Rate-Distortion Perspective on Bayesian Neural Networks.”
Hyland, Gavenčiak, Costa, et al. 2024. Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents.” In.
Kandasamy, Krishnamurthy, Poczos, et al. 2014. Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations.” arXiv:1411.4342 [Stat].
Kolchinsky. 2024a. Thermodynamic Dissipation Does Not Bound Replicator Growth and Decay Rates.” The Journal of Chemical Physics.
———. 2024b. Partial Information Decomposition: Redundancy as Information Bottleneck.” Entropy.
Kolchinsky, Marvian, Gokler, et al. 2025. Maximizing Free Energy Gain.” Entropy.
Kolchinsky, Tracey, and Wolpert. 2019. Nonlinear Information Bottleneck.” Entropy.
Kolchinsky, and Wolpert. 2018. Semantic Information, Autonomous Agency, and Nonequilibrium Statistical Physics.”
———. 2020. Thermodynamic Costs of Turing Machines.” Physical Review Research.
Kullback, and Leibler. 1951. On Information and Sufficiency.” The Annals of Mathematical Statistics.
MacKay. 2002. Information Theory, Inference & Learning Algorithms.
Mandelbrot. 1962. The Role of Sufficiency and of Estimation in Thermodynamics.” The Annals of Mathematical Statistics.
Montanari. 2015. Computational Implications of Reducing Data to Sufficient Statistics.” Electronic Journal of Statistics.
Moshkovitz, Dana, and Moshkovitz. 2017. Mixing Implies Lower Bounds for Space Bounded Learning.” In Conference on Learning Theory.
Moshkovitz, Michal, and Tishby. 2017a. Mixing Complexity and Its Applications to Neural Networks.” arXiv:1703.00729 [Cs].
———. 2017b. A General Memory-Bounded Learning Algorithm.” arXiv:1712.03524 [Cs].
Rissanen, J. 1978. Modeling by Shortest Data Description.” Automatica.
———. 1984. Universal Coding, Information, Prediction, and Estimation.” IEEE Transactions on Information Theory.
Rissanen, Jorma. 2007. Information and Complexity in Statistical Modeling. Information Science and Statistics.
Saxe, Bansal, Dapello, et al. 2019. On the Information Bottleneck Theory of Deep Learning.” Journal of Statistical Mechanics: Theory and Experiment.
Shalizi, and Crutchfield. 2002. Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” Advances in Complex Systems.
Shwartz-Ziv. 2022. Information Flow in Deep Neural Networks.”
Shwartz-Ziv, and Tishby. 2017. Opening the Black Box of Deep Neural Networks via Information.”
Slonim. 2002. The Information Bottleneck: Theory and Applications.”
Slonim, Friedman, and Tishby. 2006. Multivariate Information Bottleneck.” Neural Computation.
Tishby, Pereira, and Bialek. 2000. The Information Bottleneck Method.”
Tishby, and Zaslavsky. 2015a. Deep Learning and the Information Bottleneck Principle.”
———. 2015b. Deep Learning and the Information Bottleneck Principle.” In 2015 IEEE Information Theory Workshop (ITW).
Vitányi. 2006. Meaningful Information.” IEEE Transactions on Information Theory.
Wallace. 1990. Classification by Minimum-Message-Length Inference.” In Advances in Computing and Information — ICCI ’90. Lecture Notes in Computer Science.
Wolpert. 2008. Physical Limits of Inference.” Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?,.