Entropy vs information
MaxEnt(?), macrostates, subjective updating, epistemic randomness, Szilard engines, Gibbs paradox…
2010-12-01 — 2026-02-14
Wherein the relation of entropy and information is examined, and macrostates are shown to be defined as the unique Markovian partitions of phase space, linking thermodynamic notions to predictive causal states.
Related somehow: algorithmic statistics, information geometry.
Computational mechanics is a theory of “state” for stochastic processes built around a criterion for state that was not what I learned in my state estimation classes back in the day: State is rather, what we need optimal prediction from data. It formalizes what it means for a process to store information about its past that is relevant for its future, and it gives a canonical predictive model (the ε-machine) plus information-theoretic invariants (e.g. statistical complexity).
See Shalizi and Crutchfield (2000).
1 Intellectual precedents
Several lines of thought converge here.
1.1 Physics stuff
Especially Statistical mechanics and dynamical systems. In physics we often want a reduced description that is predictive for macroscopic observables. Markov partitions, symbolic dynamics, ergodic theory, and the general “coarse-grain but preserve what matters” instinct sit in the background of CM. CM explicitly links its invariants to ergodic/information-theoretic quantities (Shalizi and Crutchfield 2000).
1.2 Information theory stuff
Computational mechanics is natively expressed in conditional distributions, entropies, mutual informations, and minimal sufficient statistics. In practice they tend to regard Shannon information as the canonical quantification of such quantities.(Shalizi and Crutchfield 2000).
1.3 Formal language stuff
Automata and grammar inference. The ε-machine is a probabilistic, minimal predictive automaton: states are equivalence classes of pasts, transitions are labeled by symbols.
2 Intellectual competitors
Competing modelling formalisms include:
A. History-based models. \(k\)-th order Markov, variable-order Markov, suffix trees/contexts. These approximate state by a bounded window of the past.
B. Latent-state generative models. Hidden Markov models (HMMs), state-space models, dynamic Bayes nets; with control: POMDPs. These posit hidden variables that generate observations.
C. Predictive/observable-state models. Predictive State Representations (PSRs), Observable Operator Models (OOMs), and spectral/subspace system identification methods. These also define “state” via predictions of future observables, as coordinates in a finite-dimensional feature space Boots and Gordon (2011).
CM’s distinguishing move is: define state as the minimal statistic of the past that preserves the entire conditional law of the future (not just finite horizons or chosen features), then study the structure and invariants of that object (Shalizi and Crutchfield 2000).
OK, to action that idea, let us roll forward.
3 Let’s start with stationary processes
Let \((X_t)_{t\in\mathbb Z}\) be a stationary discrete-time stochastic process over a finite alphabet \(\mathcal A\). Define the semi-infinite past and future at time \(t\): \[ \overleftarrow{X}_t := (\ldots, X_{t-2}, X_{t-1}),\qquad \overrightarrow{X}_t := (X_t, X_{t+1}, \ldots). \] Measure-theoretically, conditioning on a past \(\overleftarrow{x}\) gives a probability measure on the σ-algebra of futures.
The prediction problem is to characterize the conditional law \[ \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}). \]
4 Causal states are predictive equivalence classes of histories
Define an equivalence relation on pasts: \[ \overleftarrow{x}\sim \overleftarrow{x}' \quad\Longleftrightarrow\quad \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}) = \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}'). \] A causal state is an equivalence class under \(\sim\). Let \[ S_t := \epsilon(\overleftarrow{X}_t) \] be the random variable obtained by mapping a realized past to its class (the “causal-state map” \(\epsilon\)).
Key consequences:
Firstly, Prescience (sufficiency for prediction). Conditioning on \(S_t\) is as informative for the entire future as conditioning on the full past: \[ \mathbb P(\overrightarrow{X}_t \in \cdot \mid S_t) = \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t), \] equivalently \(H(\overrightarrow{X}_t \mid S_t)=H(\overrightarrow{X}_t \mid \overleftarrow{X}_t)\).
Secondly, Minimality (no redundant memory). Among all statistics \(R_t=\eta(\overleftarrow{X}_t)\) that are equally prescient, \(S_t\) is a minimal sufficient statistic: any prescient \(R_t\) refines (maps to) \(S_t\) a.s. (Shalizi and Crutchfield 2000).
This is the sense in which CM defines the “right” state: the one that was forced by predictive equivalence.
Sufficient statistics here are defined in an information-theoretic way: \(R_t\) is sufficient for predicting \(\overrightarrow{X}_t\) if \(I(\overrightarrow{X}_t; R_t \mid S_t)=0\) (i.e. \(R_t\) is conditionally independent of the future given \(S_t\)).
These look a lot like the sufficient statistics of exponential families, which we presumably recover for sufficiently nice models.
Quibble: It feels off to me to talk about this first and foremost of this conditional independence property as informational. Sure, \(I(\overrightarrow{X}_t; R_t \mid S_t)=0\) is an information-theoretic condition, but there are many other ways of expressing conditional independence which don’t incline us to worry about estimating entropies and mutual informations, which are notoriously cursed, and have all kinds of stupid problems when the models are not perfect, which in practice we often wish to circumvent.
5 ε-machines
The ε-machine is the state-transition structure induced by causal states that give us the canonical predictive model.
For each symbol \(a\in\mathcal A\), define labeled transition probabilities \[ T^{(a)}_{ij} := \mathbb P(X_t=a,\, S_{t+1}=s_j \mid S_t=s_i). \] You can view this as a labeled directed graph (a probabilistic automaton) whose nodes are causal states.
Term of art: unifilar.
A central structural property in CM is unifilarity: the next causal state is determined by the current causal state and the next observed symbol (i.e. state updates are deterministic given the observation symbol).
Let \((S_t, X_t)\) be an edge-emitting HMM with finite state set \(\mathcal S=\{1,\dots,n\}\) and output alphabet \(\mathcal A\). Define the symbol-labeled transition matrices \[ T^{(x)} \in [0,1]^{n\times n},\qquad T^{(x)}_{ij} := \mathbb P\!\left(X_t=x,\ S_{t+1}=j\ \middle|\ S_t=i\right), \quad x\in\mathcal A. \]
We write (right-resolving) Unifilar as follows: \[ \forall i\in\mathcal S,\ \forall x\in\mathcal A:\quad \left|\left\{j\in\mathcal S:\ T^{(x)}_{ij}>0\right\}\right|\le 1. \tag{U1} \] Equivalently, each row of each \(T^{(x)}\) has at most one nonzero entry.
Equivalent deterministic update map. There exists a (partial) function \(\delta:\mathcal S\times\mathcal A\to\mathcal S\) such that \[ T^{(x)}_{ij}=0 \quad\text{for all } j\neq \delta(i,x), \tag{U2} \] and whenever \(\mathbb P(X_t=x\mid S_t=i)>0\), \[ \mathbb P\!\left(S_{t+1}=\delta(i,x)\ \middle|\ S_t=i,\ X_t=x\right)=1. \tag{U3} \] (If \(\mathbb P(X_t=x\mid S_t=i)=0\), \(\delta(i,x)\) can be left undefined.)
Equivalent conditional-entropy statement (under the usual “ignore zero-probability conditioning” convention): \[ H(S_{t+1}\mid S_t, X_t)=0. \tag{U4} \]
Graph form. For each state \(i\) and symbol \(x\), there is at most one outgoing edge from \(i\) labeled \(x\); its destination (if it exists) is \(\delta(i,x)\).
This makes filtering straightforward: as we observe symbols we track a single causal-state trajectory rather than a distribution over hidden states. CM develops optimality/uniqueness results around this representation (Shalizi and Crutchfield 2000).
By contrast, HMM states are latent and generally non-identifiable without additional assumptions; causal states are defined from observable conditional laws. Also, a minimal HMM (fewest hidden states) need not be unifilar; CM’s ε-machine is the minimal unifilar optimal predictor (conceptually: minimal “predictive memory” with deterministic update).
6 Intrinsic information measures
Given the causal-state process \((S_t)\), CM defines invariants that separate randomness from structure.
Entropy rate \[ h_\mu := \lim_{L\to\infty} \frac{1}{L} H(X_{t:t+L-1}\mid \overleftarrow{X}_t) = H(X_t \mid \overleftarrow{X}_t) = H(X_t \mid S_t), \] where the last equality uses prescience (Shalizi and Crutchfield 2000).
Statistical complexity \[ C_\mu := H(S_t). \] Interpretation: the average amount of information the process stores about the past that is relevant to predicting the future (“predictive memory”) (Shalizi and Crutchfield 2000).
Excess entropy / predictive information \[ \mathbf E := I(\overleftarrow{X}_t;\overrightarrow{X}_t), \] the total past–future mutual information. Typically \(\mathbf E \le C_\mu\); the gap is related to “hidden” internal organization not directly manifest as past–future mutual information (often discussed under labels like crypticity in the CM literature) (Shalizi and Crutchfield 2000).
7 Inference/learning in CM
In practice, we do not know the true conditional laws, so we estimate them and merge pasts whose estimated predictive distributions are close (with statistical tests, Bayesian nonparametrics, or algorithmic heuristics). CM emphasizes that the target is the causal-state partition and ε-machine, justified by the optimality/minimality results (Shalizi and Crutchfield 2000). The hard cases are exactly those where the ε-machine has many (possibly countably infinite) causal states, or where finite data makes near-equivalences ambiguous.
8 Predictive State Representations (PSRs)
“state as a vector of predictions”
PSRs generalize the same predictive instinct to (often controlled) dynamical systems.
Setup (controlled I/O): at each time you choose an action \(a_t\), then observe \(o_t\). A test is a finite action sequence paired with a finite observation sequence; its value is the probability that, if you execute those actions from now, you will see those observations. A history is the past action/observation stream.
A PSR defines the system’s state at time \(t\) as a vector of such test probabilities: \[ \mathbf s_t := \big(\mathbb P(\text{test } q_i \text{ succeeds} \mid \text{history } h_t)\big)_{i=1}^d, \] for a chosen set of “core tests” \(q_1,\dots,q_d\). Updates are performed by conditioning on the latest action/observation, typically via linear operators in the classical finite setting (Littman and Sutton 2001; Singh, James, and Rudary 2012; Boots and Gordon 2011).
Two key claims in the PSR line:
PSRs are an alternative to latent-state POMDP belief states: they are grounded in observable predictions and can be at least as compact as minimal latent models in relevant senses (Littman and Sutton 2001; Boots and Gordon 2011).
There is a clean linear-algebraic object—the system-dynamics matrix (rows = histories, columns = tests, entries = conditional test probabilities)—whose rank (“linear dimension”) governs the minimal size of linear PSRs and yields a route to spectral learning (Singh, James, and Rudary 2012; Boots and Gordon 2011).
Learning: modern PSR/TPSR work leverages spectral/subspace algorithms (linear algebra on empirical moments) that are statistically consistent and avoid EM’s local-optimum pathology, at least under standard assumptions (Boots and Gordon 2011).
9 CM vs PSRs: how they line up
They are closer than the naming suggests:
CM’s causal state \(S_t\) is, in effect, the entire conditional law of the future given the past, modulo equality. That is an (often infinite-dimensional) object.
A PSR state \(\mathbf s_t\) is a coordinate chart on (part of) that conditional law: you pick finitely many tests/features of the future and track their conditional probabilities. If your chosen tests span the relevant space (in the linear PSR sense), you can update and predict consistently within that model class (Singh, James, and Rudary 2012; Boots and Gordon 2011).
A useful way to think about it:
- CM is canonical and exact (given the true process), but can be large/infinite and learning it is delicate.
- PSRs are explicitly model-class/feature dependent, but often yield finite-dimensional, learnable representations with efficient spectral methods, especially in controlled settings.
10 Minimal reading path
- CM foundations and optimality/uniqueness of causal states and ε-machines: (Shalizi and Crutchfield 2000).
- PSR foundations and the “two dominant approaches” (history vs generative) framing: (Littman and Sutton 2001).
- PSRs as a general theory with system-dynamics matrix and comparisons to Markov/HMM/POMDP: (Singh, James, and Rudary 2012).
- Spectral PSR/TPSR learning perspective and terminology (tests, histories) in a compact form: (Boots and Gordon 2011).
