Nearly sufficient statistics and information bottlenecks

Also, rate distortion by other means, and the measure-valued sufficient statistic

2018-03-13 — 2026-05-26

In Which the Posterior Distribution Is Recognised as the Canonical Measure-Valued Sufficient Statistic, Found to Recur Under Many Names Across Disparate Fields, and Subjected to Compression via the Information Bottleneck.

approximation
estimator distribution
functional analysis
information
linear algebra
metrics
model selection
optimization
probabilistic algorithms
probability
sparser than thou
statistics
uncertainty
Figure 1

🏗

I’m working through a small realization, for my own interest, which has helped my understanding of variational Bayes; and specifically in relating it to non-Bayes variational inference. Also sequential Monte Carlo.

Sufficient statistics are the Garden of Edenference. Formally, a statistic \(T\) — any function of the data — is sufficient for \(\theta\) iff the conditional law of the data given \(T\) doesn’t depend on \(\theta\),

\[ p(x\mid T(x),\theta)=p(x\mid T(x)), \]

equivalently the Fisher–Neyman factorisation \(p(x;\theta)=h(x)\,g(T(x);\theta)\). Everything the data say about \(\theta\) is funnelled through \(T\); conditional on \(T\), the rest of \(x\) is noise as far as \(\theta\) is concerned.

Sufficiency is a property of a function of the data; it says nothing about where that function takes values, i.e the codomain of \(T\). Nothing in the definition forces \(T\) to be small, finite-dimensional, or even a vector. The dream, though, is that it is: a fixed-size summary we can carry around in place of the full dataset. In the easy case the codomain is \(\mathbb{R}^k\) with \(k\) fixed — a vector summary, our data is infinitely compressible.

Turns out, Edenference is small. By Pitman–Koopman–Darmois (Pitman 1936; Koopman 1936; Darmois 1935), a sufficient statistic whose dimension stays fixed as the sample size \(n\) grows essentially characterises exponential families. Under even mild generalizations, things are not so nice. In mixtures, hierarchical models, anything modern including and especially neural networks — there is no fixed-size \(T\) that is exactly sufficient and which characterizes the posterior.

1 The sufficient statistic is a distribution

In general, the natural codomain of \(T\) is the space of probability measures on \(\Theta\), and the canonical sufficient statistic is the likelihood function, or equivalently in Bayes, the posterior distribution over \(\theta\) given \(x\). This still is a sufficient statistic, if we let \(T\) live somewhere bigger than \(\mathbb{R}^k\). The posterior \(\theta \mapsto p(\theta\mid x)\) is itself a function of the data — a statistic in the original sense — it just takes values in the space of probability measures on \(\Theta\) rather than in \(\mathbb{R}^k\). Bayesian sufficiency (Halmos and Savage 1949; Bahadur 1954) in fact says precisely this: \(T\) is sufficient iff, for every prior, the posterior depends on \(x\) only through \(T\) — so the posterior measure is itself sufficient, tautologically.

For prediction the analogue is the conditional law of the future given the past; Knight’s prediction process (Knight 1975) is the minimal Markov sufficient statistic for any process.

Now, it turns out measure-valued sufficient statistics pop up a lot; I hadn’t realised how many places they occur, despite seeing many objects that were remarkably similar. Most recently, I was writing up POMDP belief states and I was all like, damn that looks like a sufficient statistic — why does it have a funny name? A quick search via my friendly neighbourhood LLM revealed quite a few other places where the same object appears, under different names:

Field Name it goes by The object
Bayesian statistics posterior \(p(\theta\mid x)\); sufficient for every prior (Halmos and Savage 1949; Bahadur 1954)
Sequential analysis, Bayesian bandits belief, Bayes-adaptive state posterior over the unknown parameter; the state of the Gittins / BAMDP problem
Nonlinear filtering, HMMs the filter, conditional law \(\pi_t=\mathbb{P}(X_t\in\cdot\mid\mathcal{Y}_t)\) (Kushner 1964; Stratonovich 1960; Zakai 1969); collapses to mean + covariance in the Kalman / exponential-family case
Stochastic control information state, hyperstate (Åström 1965; Striebel 1965; Kumar and Varaiya 1986); the separation principle acts on it
POMDPs, RL belief state \(b_t\in\Delta(\mathcal{S})\) — see the belief-MDP reduction
Stochastic processes prediction process (Knight 1975); the canonical measure-valued Markov sufficient statistic
Computational mechanics causal state, ε-machine (Shalizi and Crutchfield 2000); the minimal predictively-sufficient partition of histories
Graphical models message, belief, mean parameter a belief-propagation message is a sufficient statistic for its subtree’s evidence; EP projects it onto an exponential family (Minka 2001; Wainwright and Jordan 2008)
Predictive state representations predictive state (LittmanSutton2001Predictive?); sufficiency stated over future observable tests rather than latent state
Exchangeability de Finetti mixing measure the latent “parameter” is itself a random distribution

To reprise PKD: call it what you want, outside an exponential family that measure has no fixed finite parameterization, so an exactly sufficient finite statistic is off the table.

It seems we manage to act in the world without carrying around a copy of the world in our heads; so if we are doing something like statistical analysis of the world, we must be making some approximations. The question is how cheaply we can approximate that measure while losing as little as we can afford to (whatever that, in context, means).

2 The information bottleneck

Figure 2: Development of information bottleneck through history (S. Hu et al. 2024)

Information bottlenecks are one way of formalising this finite-budget summary of a measure-valued sufficient statistic. I ransacked S. Hu et al. (2024) for a recent summary of developments in information bottleneck theory. Unless otherwise stated, the notation and results come from there.

Information bottleneck reframes “find a nearly sufficient statistic” as a variational problem: choose a relevance variable \(Y\) (what matters), then learn a representation \(T(X)\) that retains information about \(Y\) while discarding information about \(X\). Formally, minimize \(I(T;X)-\beta I(T;Y)\). With \(Y=\theta\) this yields approximate minimal sufficient statistics for inference; with \(Y=X_{t+1}\) it yields predictive state for filtering. This is a special case of rate–distortion with a data‑driven distortion and gives a principled meaning to “nearly sufficient”: small \(I(Y;X\mid T)\) at controlled memory \(I(T;X)\).

We pick a “relevance” variable \(Y\) that encodes what matters—future observations, labels, a function of \(\theta\), a decision, or the next state in a filter. Then we find a (hopefully compact) representation \(T\) of \(X\) that trades off compressing \(X\) against preserving information about \(Y\):

\[ \min_{p(t\mid x)}\; I(T;X)\;-\;\beta\, I(T;Y). \]

This is a specialization of rate–distortion, where the distortion is chosen to be the Kullback–Leibler mismatch in predicting \(Y\) from \(T\), and \(\beta\!\!>\!0\) is our trade‑off knob, i.e. rate–distortion with an informational metric.

Payoffs:

  • If we set \(Y=\theta\) (or a function of it), the optimal \(T\) is an approximately minimal sufficient statistic for inference about \(\theta\).
  • If we set \(Y=X_{t+1}\) (or a future payoff) in a filtering/control problem, we get a predictive state: a compressed memory \(T_t\) of the past that’s maximally useful for predicting what we need next.

2.1 “Nearly sufficient” IB

There are at least two ways to be “nearly sufficient” in the IB context, depending on what we want \(Y\) to do.

  1. Information sufficiency (task‑conditional independence). Define \(T\) to be \(Y\)-sufficient if \(I(Y;X\mid T)=0\). Then “\(\varepsilon\)-sufficient” means \(I(Y;X\mid T)\le \varepsilon\). IB is the convex relaxation that searches for \(T\) with small \(I(T;X)\) subject to small \(I(Y;X\mid T)\), via a Lagrange multiplier \(\beta\).

  2. Decision sufficiency. For a loss \(\ell(a,y)\), call \(T\) \(\varepsilon\)-sufficient if the Bayes risk gap using \(T\) instead of \(X\) is \(\le \varepsilon\). For log‑loss this is equivalent to #1; for other losses it’s stricter or looser. (This is where rate–distortion with a bespoke distortion can be a better fit than vanilla IB if the task isn’t log‑loss‑like.)

Pick the notion that matches our downstream risk. If we’re doing Bayesian filtering or SMC, #1 with \(Y\!=\!X_{t+1}\) or \(Y\!=\!\) future rewards is what we want. On the other hand, if we’re doing parameter estimation, #1 with \(Y\!=\!\theta\) (or a function like \(g(\theta)\)) is usually the cleanest.

2.2 Connection to Variational Bayes

The “deep variational IB” (DVIB) parametrises \(p(t\mid x)\) (the encoder) and works with a variational decoder \(q(y\mid t)\) plus a prior \(r(t)\). The training objective

\[ \max\; \mathbb{E}_{p(t\mid x)}[\log q(y\mid t)]\;-\;\beta\,\mathrm{KL}\big(p(t\mid x)\,\Vert\, r(t)\big) \]

Here’s the IB Lagrangian rewritten so we can optimize it with gradients. With \(Y=X\) and \(\beta\!=\!1\), we recover the VAE ELBO; that’s not an accident. ELBO can be seen as an IB objective where the “relevant” variable is the input itself; DVIB is ELBO with the relevance set to our task.

2.3 Filtering and state compression

In Bayes filtering we want a recursively updated \(T_t\) that is small but “task‑sufficient”. In exponential families (e.g. linear-Gaussian) filters exist, e.g. the Kalman filter.

IB gives us a target for awkward models: minimize \(I(T_t; X_{1:t})\) while maximising \(I(T_t; Y_t)\), where \(Y_t\) is whatever we care about next (the next observation, the latent state, the next reward). We can then:

  • Parameter‑tracking: We set \(Y_t=\theta\) (or \(g(\theta)\)), turning \(T_t\) into a rolling compressed posterior proxy.
  • Predictive control: We can set \(Y_t=X_{t+1}\) or a function of \(X_{t+1}\) and action \(A_t\).

2.4 In neural nets

Figure 3: Mutual information evolution in DNNs (S. Hu et al. 2024)

Each layer in a trained DNN behaves approximately like a nearly sufficient statistic for its task variable, which is a helpful mental picture even if we don’t care about DNNs. The classical IB principle was defined as an optimization problem over encoders \(p(t|x)\): compress \(X\) into \(T\) while keeping as much information about \(Y\) as possible,

\[ \min_{p(t|x)} I(T;X) - \beta I(T;Y). \]

That’s cool, but impractical as written because neural networks are large and complicated. Deep Information Bottleneck parameterizes the encoder and decoder with neural nets and trains them using gradient methods.

  • Encoder: \(p_\theta(t|x)\) (often Gaussian; mean and variance predicted by a neural net).
  • Decoder: \(q_\phi(y|t)\), a variational approximation to \(p(y|t)\).
  • Prior: \(r(t)\), usually standard Gaussian, to make the KL term computable.

So the training objective becomes:

\[ \max_{\theta,\phi}\; \mathbb{E}_{p_\theta(t|x)}[\log q_\phi(y|t)] - \beta\, \text{KL}(p_\theta(t|x)\|r(t)). \]

That’s the Deep Variational IB (DVIB) objective. If we set \(Y=X\), we get the VAE ELBO; if we set \(Y\) to be labels, we get a bottleneck tuned for classification.

The KL penalty is the mechanism of “compression”. It forces the encoder not to keep every pixel of \(X\) but only the degrees of freedom that actually help predict \(Y\).

This gives the encoder a concrete way to approximate minimal sufficient statistics for \(Y\), even in messy, high-dimensional settings.

Shwartz-Ziv and Tishby (2017) famously plotted layer-wise mutual information during NN training (I forget now if they were using IB-aware training, or if this holds generally — TODO check). In their experiments each hidden layer \(T_i\) follows a path in the “information plane” \((I(T;X), I(T;Y))\):

  • Phase 1 (fitting): mutual information with \(Y\) shoots up as the layers learn to predict.
  • Phase 2 (compression): \(I(T;X)\) falls slowly — the representation forgets irrelevant details while keeping \(I(T;Y)\) high.

The two phases correspond to what we see if we measure gradient statistics: a drift phase (large, directed updates) followed by a diffusion phase (random fluctuations dominate). The diffusion adds noise that automatically compresses representations.

I think this means that in the Deep IB view, SGD itself is a bottleneck machine.* Layers end up very close to the IB theoretical bound, which characterises optimal trade-offs between memory \(I(T;X)\) and relevance \(I(T;Y)\).

2.5 Non-convexity

Next bit was summarized by an LLM not by me; needs revision and expansion.

There are two competing terms in the functional

\[ \min_{p(t|x)} I(T;X) - \beta I(T;Y). \]

\(I(T;X)\) is convex in \(p(t|x)\), while \(I(T;Y)\) is concave in \(p(t|x)\). Their weighted difference is not convex. This means we don’t get the nice guarantees we’d have in rate–distortion theory with convex distortion measures. Instead, the IB Lagrangian can have multiple local minima.

The iterative IB algorithm (the original Blahut–Arimoto-style update rules from Tishby et al. 1999) only converges to a fixed point of the self-consistent equations, not necessarily the global optimum. This is why different runs with different initialisations can produce different clusterings or partitions of the data.

  • Classical IB (discrete case): iterative IB converges to locally optimal solutions; sometimes multiple solutions lie on the same “information curve,” but we don’t know if we’ve found the best.
  • Deep IB (DVIB): we’ve compounded the problem: now the encoder/decoder are parameterized by neural nets, so we inherit both the non-convexity of the IB objective and the non-convexity of deep learning. Optimizers (SGD, Adam) will tend to settle in some basin that’s “good enough” but not provably optimal.
  • Information-plane diagnostics (Schwartz-Ziv & Tishby): One of the reasons they use mutual information plots is to check whether the layers land near the theoretical IB bound. But the fact that different training runs converge to slightly different places reflects the non-convexity.

Researchers have tried to tame this in several ways:

  • Convex relaxations / surrogates: Variants like the Hilbert–Schmidt independence criterion bottleneck (HSIC-IB) replace mutual information with a kernel-based dependence measure, which yields a convex or easier objective for optimization.

  • Variational bounds: DVIB itself is a relaxation — we replace intractable mutual information terms with variational lower bounds. These are optimizable with SGD, but they don’t restore convexity; they just make the problem numerically tractable.

  • Deficiency and f-divergence bottlenecks: Alternatives like the variational deficiency bottleneck (VDB) swap in divergences with better optimization landscapes in some cases.

  • Annealing β / continuation: A common heuristic is to slowly increase \(\beta\), following the IB curve and avoiding bad local minima.

3 Other nearly-sufficient statistics approaches

  • Rate–Distortion (RD). Classic source coding. We choose a distortion \(d\) and find the minimal rate \(R(D)\). When our “task” is truly captured by a metric on \(x\)\(t\) pairs, RD is the cleanest. IB is RD with a data‑driven distortion that says: “Two inputs are close if they induce similar posteriors over \(Y\).”

  • Coresets / inducing sets. These reduce dataset size rather than feature dimension: we keep a small weighted subset that approximates the full objective. Great for bounded memory and streaming, orthogonal to the IB trade‑off — we can do IB on top of a coreset. See data summarization

  • MDL (Minimum Description Length). Another compression‑flavoured philosophy. MDL trades codelength of model plus residuals; IB trades mutual information for a chosen relevance. It’s related to Kolmogorov complexity.

4 Also singular learning theory

See Singular learning theory for an implicit approach to “nearly sufficient” statistics, which is more focused on the geometry of the likelihood function and the behaviour of the marginal likelihood in singular models. This comes out rather different for unclear reason.

5 References

Adragni, and Cook. 2009. Sufficient Dimension Reduction and Prediction in Regression.” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences.
Alemi, Fischer, Dillon, et al. 2019. Deep Variational Information Bottleneck.”
Åström. 1965. Optimal Control of Markov Processes with Incomplete State Information.” Journal of Mathematical Analysis and Applications.
Bachem, Lucic, and Krause. 2015. Coresets for Nonparametric Estimation - the Case of DP-Means.” In International Conference on Machine Learning.
Baez. 2011. Renyi Entropy and Free Energy.”
Bahadur. 1954. Sufficiency and Statistical Decision Functions.” The Annals of Mathematical Statistics.
Barron, A., Rissanen, and Yu. 1998. The Minimum Description Length Principle in Coding and Modeling.” IEEE Transactions on Information Theory.
Barron, Andrew, Schervisch, and Wasserman. 1999. The Consistency of Posterior Distributions in Nonparametric Problems.” The Annals of Statistics.
Bartlett, Eckford, Egbert, et al. 2025. Physics of Life: Exploring Information as a Distinctive Feature of Living Systems.” PRX Life.
Blei, Kucukelbir, and McAuliffe. 2017. Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association.
Broderick, Boyd, Wibisono, et al. 2013. Streaming Variational Bayes.” In Advances in Neural Information Processing Systems 26.
Cortes, Kuznetsov, Mohri, et al. 2016. Structured Prediction Theory Based on Factor Graph Complexity.” In Advances in Neural Information Processing Systems 29.
Cutajar, Bonilla, Michiardi, et al. 2017. Random Feature Expansions for Deep Gaussian Processes.” In PMLR.
Darmois. 1935. “Sur Les Lois de Probabilité à Estimation Exhaustive.” Comptes Rendus de l’Académie Des Sciences.
de Castro, and Dorigo. 2019. INFERNO: Inference-Aware Neural Optimisation.” Computer Physics Communications.
Dezfouli, and Bonilla. 2015. Scalable Inference for Gaussian Process Models with Black-Box Likelihoods.” In Advances in Neural Information Processing Systems 28. NIPS’15.
Dimitrakakis, Nelson, Zhang, et al. 2013. Bayesian Differential Privacy Through Posterior Sampling.” arXiv:1306.1066 [Cs, Stat].
Edwards, and Storkey. 2017. Towards a Neural Statistician.” In Proceedings of ICLR.
Elidan, and Friedman. 2005. “Learning Hidden Variable Networks: The Information Bottleneck Approach.” Journal of Machine Learning Research.
Friedman, Mosenzon, Slonim, et al. 2001. Multivariate Information Bottleneck.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.
Gribonval, Blanchard, Keriven, et al. 2017. Compressive Statistical Learning with Random Feature Moments.” arXiv:1706.07180 [Cs, Math, Stat].
Grünwald. 2004. A Tutorial Introduction to the Minimum Description Length Principle.” Advances in Minimum Description Length: Theory and Applications.
Halmos, and Savage. 1949. Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics.” The Annals of Mathematical Statistics.
Hinton, and van Camp. 1993. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights.” In Proceedings of the Sixth Annual Conference on Computational Learning Theory. COLT ’93.
Honkela, and Valpola. 2004. Variational Learning and Bits-Back Coding: An Information-Theoretic View to Bayesian Learning.” IEEE Transactions on Neural Networks.
Huggins, Adams, and Broderick. 2017. PASS-GLM: Polynomial Approximate Sufficient Statistics for Scalable Bayesian GLM Inference.” In Advances in Neural Information Processing Systems 30.
Hu, Shizhe, Lou, Yan, et al. 2024. A Survey on Information Bottleneck.” IEEE Transactions on Pattern Analysis and Machine Intelligence.
Hu, Shell Xu, Moreno, Lawrence, et al. 2018. “Β-BNN: A Rate-Distortion Perspective on Bayesian Neural Networks.”
Hyland, Gavenčiak, Costa, et al. 2024. Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents.” In.
Kandasamy, Krishnamurthy, Poczos, et al. 2014. Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations.” arXiv:1411.4342 [Stat].
Knight. 1975. A Predictive View of Continuous Time Processes.” The Annals of Probability.
Kolchinsky. 2024a. Thermodynamic Dissipation Does Not Bound Replicator Growth and Decay Rates.” The Journal of Chemical Physics.
———. 2024b. Partial Information Decomposition: Redundancy as Information Bottleneck.” Entropy.
Kolchinsky, Marvian, Gokler, et al. 2025. Maximizing Free Energy Gain.” Entropy.
Kolchinsky, Tracey, and Wolpert. 2019. Nonlinear Information Bottleneck.” Entropy.
Kolchinsky, and Wolpert. 2020. Thermodynamic Costs of Turing Machines.” Physical Review Research.
Koopman. 1936. On Distributions Admitting a Sufficient Statistic.” Transactions of the American Mathematical Society.
Kullback, and Leibler. 1951. On Information and Sufficiency.” The Annals of Mathematical Statistics.
Kumar, and Varaiya. 1986. Stochastic Systems: Estimation, Identification, and Adaptive Control. Classics in Applied Mathematics.
Kushner. 1964. On the Differential Equations Satisfied by Conditional Probability Densities of Markov Processes, with Applications.” Journal of the Society for Industrial and Applied Mathematics, Series A: Control.
Littman, Sutton, and Singh. 2001. Predictive Representations of State.” In Advances in Neural Information Processing Systems 14 (NIPS 2001).
MacKay. 2002. Information Theory, Inference & Learning Algorithms.
Mandelbrot. 1962. The Role of Sufficiency and of Estimation in Thermodynamics.” The Annals of Mathematical Statistics.
Minka. 2001. Expectation Propagation for Approximate Bayesian Inference.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.
Montanari. 2015. Computational Implications of Reducing Data to Sufficient Statistics.” Electronic Journal of Statistics.
Moshkovitz, Dana, and Moshkovitz. 2017. Mixing Implies Lower Bounds for Space Bounded Learning.” In Conference on Learning Theory.
Moshkovitz, Michal, and Tishby. 2017a. Mixing Complexity and Its Applications to Neural Networks.” arXiv:1703.00729 [Cs].
———. 2017b. A General Memory-Bounded Learning Algorithm.” arXiv:1712.03524 [Cs].
Pitman. 1936. Sufficient Statistics and Intrinsic Accuracy.” Mathematical Proceedings of the Cambridge Philosophical Society.
Rissanen, J. 1978. Modeling by Shortest Data Description.” Automatica.
———. 1984. Universal Coding, Information, Prediction, and Estimation.” IEEE Transactions on Information Theory.
Rissanen, Jorma. 2007. Information and Complexity in Statistical Modeling. Information Science and Statistics.
Saxe, Bansal, Dapello, et al. 2019. On the Information Bottleneck Theory of Deep Learning.” Journal of Statistical Mechanics: Theory and Experiment.
Shalizi, and Crutchfield. 2000. Computational Mechanics: Pattern and Prediction, Structure and Simplicity.” Journal of Statistical Physics.
Shalizi, and Crutchfield. 2002. Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” Advances in Complex Systems.
Shwartz-Ziv. 2022. Information Flow in Deep Neural Networks.”
Shwartz-Ziv, and Tishby. 2017. Opening the Black Box of Deep Neural Networks via Information.”
Slonim. 2002. The Information Bottleneck: Theory and Applications.”
Slonim, Friedman, and Tishby. 2006. Multivariate Information Bottleneck.” Neural Computation.
Stratonovich. 1960. Conditional Markov Processes.” Theory of Probability & Its Applications.
Striebel. 1965. Sufficient Statistics in the Optimum Control of Stochastic Systems.” Journal of Mathematical Analysis and Applications.
Tishby, Pereira, and Bialek. 2000. The Information Bottleneck Method.”
Tishby, and Zaslavsky. 2015. Deep Learning and the Information Bottleneck Principle.” In 2015 IEEE Information Theory Workshop (ITW).
Vitányi. 2006. Meaningful Information.” IEEE Transactions on Information Theory.
Wainwright, and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.
Wallace. 1990. Classification by Minimum-Message-Length Inference.” In Advances in Computing and Information — ICCI ’90. Lecture Notes in Computer Science.
Wolpert. 2008. Physical Limits of Inference.” Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?,.
Zakai. 1969. On the Optimal Filtering of Diffusion Processes.” Zeitschrift Für Wahrscheinlichkeitstheorie Und Verwandte Gebiete.