Entropy vs information

MaxEnt(?), macrostates, subjective updating, epistemic randomness, Szilard engines, Gibbs paradox…

2010-12-01 — 2026-02-14

Wherein the relation of entropy and information is examined, and macrostates are shown to be defined as the unique Markovian partitions of phase space, linking thermodynamic notions to predictive causal states.

Bayes
dynamical systems
machine learning
neural nets
physics
pseudorandomness
sciml
statistics
statmech
stochastic processes
Figure 1

Related somehow: algorithmic statistics, information geometry.

Computational mechanics is a theory of “state” for stochastic processes built around a criterion for state that was not what I learned in my state estimation classes back in the day: State is rather, what we need optimal prediction from data. It formalizes what it means for a process to store information about its past that is relevant for its future, and it gives a canonical predictive model (the ε-machine) plus information-theoretic invariants (e.g. statistical complexity).

See Shalizi and Crutchfield (2000).

1 Intellectual precedents

Several lines of thought converge here.

1.1 Physics stuff

Especially Statistical mechanics and dynamical systems. In physics we often want a reduced description that is predictive for macroscopic observables. Markov partitions, symbolic dynamics, ergodic theory, and the general “coarse-grain but preserve what matters” instinct sit in the background of CM. CM explicitly links its invariants to ergodic/information-theoretic quantities (Shalizi and Crutchfield 2000).

1.2 Information theory stuff

Computational mechanics is natively expressed in conditional distributions, entropies, mutual informations, and minimal sufficient statistics. In practice they tend to regard Shannon information as the canonical quantification of such quantities.(Shalizi and Crutchfield 2000).

1.3 Formal language stuff

Automata and grammar inference. The ε-machine is a probabilistic, minimal predictive automaton: states are equivalence classes of pasts, transitions are labeled by symbols.

2 Intellectual competitors

Competing modelling formalisms include:

A. History-based models. \(k\)-th order Markov, variable-order Markov, suffix trees/contexts. These approximate state by a bounded window of the past.

B. Latent-state generative models. Hidden Markov models (HMMs), state-space models, dynamic Bayes nets; with control: POMDPs. These posit hidden variables that generate observations.

C. Predictive/observable-state models. Predictive State Representations (PSRs), Observable Operator Models (OOMs), and spectral/subspace system identification methods. These also define “state” via predictions of future observables, as coordinates in a finite-dimensional feature space Boots and Gordon (2011).

CM’s distinguishing move is: define state as the minimal statistic of the past that preserves the entire conditional law of the future (not just finite horizons or chosen features), then study the structure and invariants of that object (Shalizi and Crutchfield 2000).

OK, to action that idea, let us roll forward.

3 Let’s start with stationary processes

Let \((X_t)_{t\in\mathbb Z}\) be a stationary discrete-time stochastic process over a finite alphabet \(\mathcal A\). Define the semi-infinite past and future at time \(t\): \[ \overleftarrow{X}_t := (\ldots, X_{t-2}, X_{t-1}),\qquad \overrightarrow{X}_t := (X_t, X_{t+1}, \ldots). \] Measure-theoretically, conditioning on a past \(\overleftarrow{x}\) gives a probability measure on the σ-algebra of futures.

The prediction problem is to characterize the conditional law \[ \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}). \]

4 Causal states are predictive equivalence classes of histories

Define an equivalence relation on pasts: \[ \overleftarrow{x}\sim \overleftarrow{x}' \quad\Longleftrightarrow\quad \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}) = \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}'). \] A causal state is an equivalence class under \(\sim\). Let \[ S_t := \epsilon(\overleftarrow{X}_t) \] be the random variable obtained by mapping a realized past to its class (the “causal-state map” \(\epsilon\)).

Key consequences:

Firstly, Prescience (sufficiency for prediction). Conditioning on \(S_t\) is as informative for the entire future as conditioning on the full past: \[ \mathbb P(\overrightarrow{X}_t \in \cdot \mid S_t) = \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t), \] equivalently \(H(\overrightarrow{X}_t \mid S_t)=H(\overrightarrow{X}_t \mid \overleftarrow{X}_t)\).

Secondly, Minimality (no redundant memory). Among all statistics \(R_t=\eta(\overleftarrow{X}_t)\) that are equally prescient, \(S_t\) is a minimal sufficient statistic: any prescient \(R_t\) refines (maps to) \(S_t\) a.s. (Shalizi and Crutchfield 2000).

This is the sense in which CM defines the “right” state: the one that was forced by predictive equivalence.

Sufficient statistics here are defined in an information-theoretic way: \(R_t\) is sufficient for predicting \(\overrightarrow{X}_t\) if \(I(\overrightarrow{X}_t; R_t \mid S_t)=0\) (i.e. \(R_t\) is conditionally independent of the future given \(S_t\)).

These look a lot like the sufficient statistics of exponential families, which we presumably recover for sufficiently nice models.

Quibble: It feels off to me to talk about this first and foremost of this conditional independence property as informational. Sure, \(I(\overrightarrow{X}_t; R_t \mid S_t)=0\) is an information-theoretic condition, but there are many other ways of expressing conditional independence which don’t incline us to worry about estimating entropies and mutual informations, which are notoriously cursed, and have all kinds of stupid problems when the models are not perfect, which in practice we often wish to circumvent.

From this point on, I have use the LLM a lot without adequate verification. Read on at your peril.

5 ε-machines

The ε-machine is the state-transition structure induced by causal states that give us the canonical predictive model.

For each symbol \(a\in\mathcal A\), define labeled transition probabilities \[ T^{(a)}_{ij} := \mathbb P(X_t=a,\, S_{t+1}=s_j \mid S_t=s_i). \] You can view this as a labeled directed graph (a probabilistic automaton) whose nodes are causal states.

Term of art: unifilar.

A central structural property in CM is unifilarity: the next causal state is determined by the current causal state and the next observed symbol (i.e. state updates are deterministic given the observation symbol).

Let \((S_t, X_t)\) be an edge-emitting HMM with finite state set \(\mathcal S=\{1,\dots,n\}\) and output alphabet \(\mathcal A\). Define the symbol-labeled transition matrices \[ T^{(x)} \in [0,1]^{n\times n},\qquad T^{(x)}_{ij} := \mathbb P\!\left(X_t=x,\ S_{t+1}=j\ \middle|\ S_t=i\right), \quad x\in\mathcal A. \]

We write (right-resolving) Unifilar as follows: \[ \forall i\in\mathcal S,\ \forall x\in\mathcal A:\quad \left|\left\{j\in\mathcal S:\ T^{(x)}_{ij}>0\right\}\right|\le 1. \tag{U1} \] Equivalently, each row of each \(T^{(x)}\) has at most one nonzero entry.

Equivalent deterministic update map. There exists a (partial) function \(\delta:\mathcal S\times\mathcal A\to\mathcal S\) such that \[ T^{(x)}_{ij}=0 \quad\text{for all } j\neq \delta(i,x), \tag{U2} \] and whenever \(\mathbb P(X_t=x\mid S_t=i)>0\), \[ \mathbb P\!\left(S_{t+1}=\delta(i,x)\ \middle|\ S_t=i,\ X_t=x\right)=1. \tag{U3} \] (If \(\mathbb P(X_t=x\mid S_t=i)=0\), \(\delta(i,x)\) can be left undefined.)

Equivalent conditional-entropy statement (under the usual “ignore zero-probability conditioning” convention): \[ H(S_{t+1}\mid S_t, X_t)=0. \tag{U4} \]

Graph form. For each state \(i\) and symbol \(x\), there is at most one outgoing edge from \(i\) labeled \(x\); its destination (if it exists) is \(\delta(i,x)\).

This makes filtering straightforward: as we observe symbols we track a single causal-state trajectory rather than a distribution over hidden states. CM develops optimality/uniqueness results around this representation (Shalizi and Crutchfield 2000).

By contrast, HMM states are latent and generally non-identifiable without additional assumptions; causal states are defined from observable conditional laws. Also, a minimal HMM (fewest hidden states) need not be unifilar; CM’s ε-machine is the minimal unifilar optimal predictor (conceptually: minimal “predictive memory” with deterministic update).

6 Intrinsic information measures

Given the causal-state process \((S_t)\), CM defines invariants that separate randomness from structure.

  • Entropy rate \[ h_\mu := \lim_{L\to\infty} \frac{1}{L} H(X_{t:t+L-1}\mid \overleftarrow{X}_t) = H(X_t \mid \overleftarrow{X}_t) = H(X_t \mid S_t), \] where the last equality uses prescience (Shalizi and Crutchfield 2000).

  • Statistical complexity \[ C_\mu := H(S_t). \] Interpretation: the average amount of information the process stores about the past that is relevant to predicting the future (“predictive memory”) (Shalizi and Crutchfield 2000).

  • Excess entropy / predictive information \[ \mathbf E := I(\overleftarrow{X}_t;\overrightarrow{X}_t), \] the total past–future mutual information. Typically \(\mathbf E \le C_\mu\); the gap is related to “hidden” internal organization not directly manifest as past–future mutual information (often discussed under labels like crypticity in the CM literature) (Shalizi and Crutchfield 2000).

7 Inference/learning in CM

In practice, we do not know the true conditional laws, so we estimate them and merge pasts whose estimated predictive distributions are close (with statistical tests, Bayesian nonparametrics, or algorithmic heuristics). CM emphasizes that the target is the causal-state partition and ε-machine, justified by the optimality/minimality results (Shalizi and Crutchfield 2000). The hard cases are exactly those where the ε-machine has many (possibly countably infinite) causal states, or where finite data makes near-equivalences ambiguous.

8 Predictive State Representations (PSRs)

“state as a vector of predictions”

PSRs generalize the same predictive instinct to (often controlled) dynamical systems.

Setup (controlled I/O): at each time you choose an action \(a_t\), then observe \(o_t\). A test is a finite action sequence paired with a finite observation sequence; its value is the probability that, if you execute those actions from now, you will see those observations. A history is the past action/observation stream.

A PSR defines the system’s state at time \(t\) as a vector of such test probabilities: \[ \mathbf s_t := \big(\mathbb P(\text{test } q_i \text{ succeeds} \mid \text{history } h_t)\big)_{i=1}^d, \] for a chosen set of “core tests” \(q_1,\dots,q_d\). Updates are performed by conditioning on the latest action/observation, typically via linear operators in the classical finite setting (Littman and Sutton 2001; Singh, James, and Rudary 2012; Boots and Gordon 2011).

Two key claims in the PSR line:

  • PSRs are an alternative to latent-state POMDP belief states: they are grounded in observable predictions and can be at least as compact as minimal latent models in relevant senses (Littman and Sutton 2001; Boots and Gordon 2011).

  • There is a clean linear-algebraic object—the system-dynamics matrix (rows = histories, columns = tests, entries = conditional test probabilities)—whose rank (“linear dimension”) governs the minimal size of linear PSRs and yields a route to spectral learning (Singh, James, and Rudary 2012; Boots and Gordon 2011).

Learning: modern PSR/TPSR work leverages spectral/subspace algorithms (linear algebra on empirical moments) that are statistically consistent and avoid EM’s local-optimum pathology, at least under standard assumptions (Boots and Gordon 2011).

9 CM vs PSRs: how they line up

They are closer than the naming suggests:

  • CM’s causal state \(S_t\) is, in effect, the entire conditional law of the future given the past, modulo equality. That is an (often infinite-dimensional) object.

  • A PSR state \(\mathbf s_t\) is a coordinate chart on (part of) that conditional law: you pick finitely many tests/features of the future and track their conditional probabilities. If your chosen tests span the relevant space (in the linear PSR sense), you can update and predict consistently within that model class (Singh, James, and Rudary 2012; Boots and Gordon 2011).

A useful way to think about it:

  • CM is canonical and exact (given the true process), but can be large/infinite and learning it is delicate.
  • PSRs are explicitly model-class/feature dependent, but often yield finite-dimensional, learnable representations with efficient spectral methods, especially in controlled settings.

10 Minimal reading path

11 References

Atad, Mediano, Rosas, et al. 2025. Meditation and Complexity: A Review and Synthesis of Evidence.
Ay, Bernigau, Der, et al. 2012. Information-Driven Self-Organization: The Dynamical System Approach to Autonomous Robot Behavior.” Theory in Biosciences.
Barnett, and Seth. 2023. Dynamical Independence: Discovering Emergent Macroscopic Processes in Complex Dynamical Systems.” Physical Review E.
Bialek, Nemenman, and Tishby. 2001. Complexity Through Nonextensivity.” Physica A: Statistical and Theoretical Physics.
———. 2006. Predictability, Complexity, and Learning.” Neural Computation.
Boots, and Gordon. 2011. An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems.” Proceedings of the AAAI Conference on Artificial Intelligence.
Boyd, A. B., Crutchfield, and Gu. 2020. Thermodynamic Machine Learning Through Maximum Work Production.”
Boyd, Alexander, Nowak, Hyland, et al. 2025. From Monoliths to Modules: Decomposing Transducers for Efficient World Modelling.”
Carleo, Cirac, Cranmer, et al. 2019. Machine Learning and the Physical Sciences.” Reviews of Modern Physics.
Crutchfield, James P. 1992. “Semantics and Thermodynamics.” In Nonlinear Modeling and Forecasting. Santa Fe Institute Studies in the Sciences of Complexity, XII.
Crutchfield, James P. 1994. The Calculi of Emergence: Computation, Dynamics and Induction.” Physica D: Nonlinear Phenomena.
Crutchfield, James P., and Jurgens. 2025. Agentic Information Theory: Ergodicity and Intrinsic Semantics of Information Processes.”
Crutchfield, and Young. 1988. “Computation at the Onset of Chaos.”
Crutchfield, James P., and Young. 1989. Inferring Statistical Complexity.” Physical Review Letters.
Gourieroux, and Jasiak. 2015. Filtering, Prediction and Simulation Methods for Noncausal Processes.” Journal of Time Series Analysis.
Kaplanis, Mediano, and Rosas. 2023. Learning Causally Emergent Representations.” In.
Klyubin, Polani, and Nehaniv. 2005. Empowerment: A Universal Agent-Centric Measure of Control.” In 2005 IEEE Congress on Evolutionary Computation.
Kolchinsky, Tracey, and Wolpert. 2019. Nonlinear Information Bottleneck.” Entropy.
Littman, and Sutton. 2001. Predictive Representations of State.” In Advances in Neural Information Processing Systems.
Lizier, Joseph T, and Prokopenko. 2010. Differentiating Information Transfer and Causal Effect.” The European Physical Journal B - Condensed Matter and Complex Systems.
Lizier, Joseph, Prokopenko, and Zomaya. 2007. Detecting Non-Trivial Computation in Complex Dynamics.” In Advances in Artificial Life. Lecture Notes in Computer Science.
Lizier, Joseph T, Prokopenko, and Zomaya. 2008. Local Information Transfer as a Spatiotemporal Filter for Complex Systems.” Physical Review E.
Marzen, and Crutchfield. 2020. Inference, Prediction, and Entropy-Rate Estimation of Continuous-Time, Discrete-Event Processes.” arXiv:2005.03750 [Cond-Mat, Physics:nlin, Stat].
Milinkovic, Barnett, Carter, et al. 2025. Capturing the Emergent Dynamical Structure in Biophysical Neural Models.” PLOS Computational Biology.
Mitchell, M., Crutchfield, and Hraber. 1993. Dynamics, Computation, and the ‘Edge of Chaos’: A Re-Examination.”
Mitchell, Melanie, Hraber, and Crutchfield. 1993. Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations.” arXiv:adap-Org/9303003.
Moshkovitz, and Tishby. 2017. A General Memory-Bounded Learning Algorithm.” arXiv:1712.03524 [Cs].
Polani. 2008. Foundations and Formalizations of Self-Organization.” Advances in Applied Self-Organizing Systems.
Prokopenko, Ay, Obst, et al. 2010. Phase Transitions in Least-Effort Communications.” Journal of Statistical Mechanics: Theory and Experiment.
Prokopenko, Boschetti, and Ryan. 2009. An Information-Theoretic Primer on Complexity, Self-Organization, and Emergence.” Complexity.
Prokopenko, Davies, Harré, et al. 2024. Biological Arrow of Time: Emergence of Tangled Information Hierarchies and Self-Modelling Dynamics.”
Riechers, Bigelow, Alt, et al. 2025. Next-Token Pretraining Implies in-Context Learning.”
Ron, Singer, and Tishby. 1996. “The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length.” Machine Learning.
Rosas, Fernando, Boyd, and Baltieri. 2025. AI in a Vat: Fundamental Limits of Efficient World Modelling for Agent Sandboxing and Interpretability.” In.
Rosas, Fernando E., Geiger, Luppi, et al. 2024. Software in the Natural World: A Computational Approach to Hierarchical Emergence.”
Salge, Glackin, and Polani. 2014. Empowerment–An Introduction.” In Guided Self-Organization: Inception.
Sato, and Crutchfield. 2003. “Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems.” Physical Review E.
Shai, Marzen, Teixeira, et al. 2025. Transformers Represent Belief State Geometry in Their Residual Stream.”
Shalizi, and Crutchfield. 2000. Computational Mechanics: Pattern and Prediction, Structure and Simplicity.”
Shalizi, and Crutchfield. 2002. Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” Advances in Complex Systems.
Singh, James, and Rudary. 2012. Predictive State Representations: A New Theory for Modeling Dynamical Systems.”
Strelioff, Crutchfield, and Hübler. 2007. Inferring Markov Chains: Bayesian Estimation, Model Comparison, Entropy Rate, and Out-of-Class Modeling.” Phys. Rev. E.
Tishby, and Polani. 2011. “Information Theory of Decisions and Actions.” In PERCEPTION-ACTION CYCLE.
Wolpert, David H. 2006a. Information Theory — The Bridge Connecting Bounded Rational Game Theory and Statistical Physics.” In Complex Engineered Systems. Understanding Complex Systems.
———. 2006b. What Information Theory Says about Bounded Rational Best Response.” In The Complex Networks of Economic Interactions. Lecture Notes in Economics and Mathematical Systems 567.
Wolpert, David H. 2008. Physical Limits of Inference.” Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?,.
———. 2018. Overview of Information Theory, Computer Science Theory, and Stochastic Thermodynamics for Thermodynamics of Computation.”
———. 2019. Stochastic Thermodynamics of Computation.”
Wolpert, David, and Scharnhorst. 2024. Stochastic Process Turing Machines.”