Computational mechanics
the intersection somehow between macrostates, subjective updating, epistemic randomness, Szilard engines, Gibbs paradox…
2010-12-01 — 2026-02-14
Wherein computational mechanics is set forth: causal states are defined as pasts sharing a future law, an ε-machine is induced as a unifilar automaton, and Cμ is taken as stored predictive memory.
Related, somehow: algorithmic statistics, information geometry.
Computational mechanics is a theory of “state” for stochastic processes built around a criterion for state that wasn’t what I learned in my state estimation classes back in the day: State is, rather, what we need for optimal prediction from data. It formalizes what it means for a process to store information about its past that is relevant for its future, and it gives a canonical predictive model (the ε-machine) plus information-theoretic invariants (e.g. statistical complexity).
See (Shalizi and Crutchfield 2000).
1 Intellectual precedents
Several lines of thought converge here.
1.1 Physics stuff
Especially statistical mechanics and dynamical systems. In physics, we often want a reduced description that is predictive of macroscopic observables. Markov partitions, symbolic dynamics, ergodic theory, and the general “coarse-grain but preserve what matters” instinct sit in the background of CM. CM explicitly links its invariants to ergodic/information-theoretic quantities (Shalizi and Crutchfield 2000).
1.2 Information theory stuff
Computational mechanics is natively expressed in conditional distributions, entropies, mutual information, and minimal sufficient statistics. In practice, they tend to treat Shannon information as the canonical quantification of such quantities (Shalizi and Crutchfield 2000).
1.3 Formal language stuff
Automata and grammar inference. The ε-machine is a probabilistic, minimal predictive automaton: states are equivalence classes of pasts, transitions are labeled by symbols.
2 Intellectual competitors
Competing modelling formalisms include:
A. History-based models. \(k\)-th order Markov, variable-order Markov, suffix trees/contexts. These approximate state with a bounded window of the past.
B. Latent-state generative models. Hidden Markov models (HMMs), state-space models, dynamic Bayes nets; with control: POMDPs. These posit hidden variables that generate observations.
C. Predictive/observable-state models. Predictive State Representations (PSRs), Observable Operator Models (OOMs), and spectral/subspace system identification methods. These also define “state” via predictions of future observables, as coordinates in a finite-dimensional feature space Boots and Gordon (2011).
CM’s distinguishing move is: define state as the minimal statistic of the past that preserves the full conditional law of the future (not just finite horizons or chosen features), then study the structure and invariants of that object (Shalizi and Crutchfield 2000).
OK—to put that idea into action, let’s roll forward.
3 Let’s start with stationary processes
Let \((X_t)_{t\in\mathbb Z}\) be a stationary discrete-time stochastic process over a finite alphabet \(\mathcal A\). Define the semi-infinite past and future at time \(t\): \[ \overleftarrow{X}_t := (\ldots, X_{t-2}, X_{t-1}),\qquad \overrightarrow{X}_t := (X_t, X_{t+1}, \ldots). \] Measure-theoretically, conditioning on a past \(\overleftarrow{x}\) gives a probability measure on the σ-algebra of futures.
The prediction problem is to characterize the conditional law \[ \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}). \]
4 Causal states are predictive equivalence classes of histories
Let’s define an equivalence relation on pasts: \[ \overleftarrow{x}\sim \overleftarrow{x}' \quad\Longleftrightarrow\quad \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}) = \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}'). \] A causal state is an equivalence class under the relation \(\sim\). Let \[ S_t := \epsilon(\overleftarrow{X}_t) \] be the random variable obtained by mapping a realized past to its class (the “causal-state map” \(\epsilon\)).
Key consequences:
First, Prescience (sufficiency for prediction). Conditioning on \(S_t\) is just as informative about the entire future as conditioning on the full past: \[ \mathbb P(\overrightarrow{X}_t \in \cdot \mid S_t) = \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t), \] Equivalently, \(H(\overrightarrow{X}_t \mid S_t)=H(\overrightarrow{X}_t \mid \overleftarrow{X}_t)\).
Secondly, Minimality (no redundant memory). Among all equally prescient statistics \(R_t=\eta(\overleftarrow{X}_t)\), \(S_t\) is a minimal sufficient statistic: any prescient \(R_t\) refines (maps to) \(S_t\) a.s. (Shalizi and Crutchfield 2000).
This is the sense in which CM defines the “right” state: the one forced by predictive equivalence.
Sufficient statistics here are defined in an information-theoretic way: \(R_t\) is sufficient for predicting \(\overrightarrow{X}_t\) if \(I(\overrightarrow{X}_t; R_t \mid S_t)=0\) (i.e. \(R_t\) is conditionally independent of the future given \(S_t\)).
These look a lot like the sufficient statistics of exponential families, which we presumably recover for sufficiently nice models.
Quibble: It feels off to me to talk about this, first and foremost, in terms of this conditional independence property as informational. Sure, \(I(\overrightarrow{X}_t; R_t \mid S_t)=0\) is an information-theoretic condition, but there are plenty of other ways of expressing conditional independence that don’t make us think about estimating entropies and mutual information, which are notoriously cursed and come with all kinds of annoying problems when our models aren’t perfect—problems we often want to circumvent in practice.
5 ε-machines
The ε-machine is the state-transition structure induced by causal states, giving us the canonical predictive model.
For each symbol \(a\in\mathcal A\), define labelled transition probabilities \[ T^{(a)}_{ij} := \mathbb P(X_t=a,\, S_{t+1}=s_j \mid S_t=s_i). \] We can view this as a labelled directed graph (a probabilistic automaton) whose nodes are causal states.
Term of art: unifilar.
A central structural property in CM is unifilarity: the next causal state is determined by the current causal state and the next observed symbol (i.e. state updates are deterministic given the observed symbol).
Let \((S_t, X_t)\) be an edge-emitting HMM with finite state set \(\mathcal S=\{1,\dots,n\}\) and output alphabet \(\mathcal A\). Define the symbol-labelled transition matrices \[ T^{(x)} \in [0,1]^{n\times n},\qquad T^{(x)}_{ij} := \mathbb P\!\left(X_t=x,\ S_{t+1}=j\ \middle|\ S_t=i\right), \quad x\in\mathcal A. \]
We write (right-resolving) unifilar as follows: \[ \forall i\in\mathcal S,\ \forall x\in\mathcal A:\quad \left|\left\{j\in\mathcal S:\ T^{(x)}_{ij}>0\right\}\right|\le 1. \tag{U1} \] Equivalently, each row of \(T^{(x)}\) has at most one nonzero entry.
Equivalent deterministic update map. There exists a (partial) function \(\delta:\mathcal S\times\mathcal A\to\mathcal S\) such that \[ T^{(x)}_{ij}=0 \quad\text{for all } j\neq \delta(i,x), \tag{U2} \] and whenever \(\mathbb P(X_t=x\mid S_t=i)>0\), \[ \mathbb P\!\left(S_{t+1}=\delta(i,x)\ \middle|\ S_t=i,\ X_t=x\right)=1. \tag{U3} \] Equivalent conditional entropy statement (under the usual “ignore conditioning on zero-probability events” convention): \[ H(S_{t+1}\mid S_t, X_t)=0. \tag{U4} \]
Graph form. For each state \(i\) and symbol \(x\), there is at most one outgoing edge from \(i\) labelled \(x\); its destination (if it exists) is \(\delta(i,x)\).
This makes filtering straightforward: as we observe symbols we track a single causal-state trajectory rather than a distribution over hidden states. CM develops optimality/uniqueness results around this representation (Shalizi and Crutchfield 2000).
By contrast, HMM states are latent and generally non-identifiable without additional assumptions; causal states are defined from observable conditional laws. Also, a minimal HMM (fewest hidden states) need not be unifilar; CM’s ε-machine is the minimal unifilar optimal predictor (conceptually: minimal “predictive memory” with a deterministic update).
6 Intrinsic information measures
Given the causal-state process \((S_t)\), CM defines invariants that separate randomness from structure.
Entropy rate \[ h_\mu := \lim_{L\to\infty} \frac{1}{L} H(X_{t:t+L-1}\mid \overleftarrow{X}_t) = H(X_t \mid \overleftarrow{X}_t) = H(X_t \mid S_t), \] The last equality uses prescience (Shalizi and Crutchfield 2000).
Statistical complexity \[ C_\mu := H(S_t). \] Interpretation: the average amount of information a process stores about the past that’s relevant for predicting the future (“predictive memory”) (Shalizi and Crutchfield 2000).
Excess entropy / predictive information \[ \mathbf E := I(\overleftarrow{X}_t;\overrightarrow{X}_t), \] The total past–future mutual information. Typically \(\mathbf E \le C_\mu\); the gap relates to “hidden” internal organization that isn’t directly manifest as past–future mutual information (often discussed under labels like crypticity in the CM literature) (Shalizi and Crutchfield 2000).
7 Inference/learning in CM
In practice, we don’t know the true conditional laws, so we estimate them and merge pasts whose estimated predictive distributions are close (via statistical tests, Bayesian nonparametrics, or algorithmic heuristics). CM emphasizes that the target is the causal-state partition and ε-machine, justified by the optimality/minimality results (Shalizi and Crutchfield 2000). The hard cases are the ones where the ε-machine has many (possibly countably infinite) causal states, or where finite data makes near-equivalences ambiguous.
8 Predictive State Representations (PSRs)
“state as a vector of predictions”
PSRs take that same predictive instinct and apply it to (often controlled) dynamical systems.
Setup (controlled I/O): at each time step, we choose an action \(a_t\), then observe \(o_t\). A test is a finite action sequence paired with a finite observation sequence; its value is the probability that, if we execute those actions from now, we will see those observations. A history is the past action/observation stream.
A PSR defines the system’s state at time \(t\) as a vector of such test probabilities: \[ \mathbf s_t := \big(\mathbb P(\text{test } q_i \text{ succeeds} \mid \text{history } h_t)\big)_{i=1}^d, \] For a chosen set of “core tests” \(q_1,\dots,q_d\), updates are performed by conditioning on the latest action/observation, typically via linear operators in the classical finite setting (Littman and Sutton 2001; Singh, James, and Rudary 2012; Boots and Gordon 2011).
Two key claims in the PSR line of work:
PSRs are an alternative to latent-state POMDP belief states: they are grounded in observable predictions and can be at least as compact as minimal latent models in relevant senses (Littman and Sutton 2001; Boots and Gordon 2011).
There is a clean linear-algebraic object—the system-dynamics matrix (rows = histories, columns = tests, entries = conditional test probabilities)—whose rank (“linear dimension”) governs the minimal size of linear PSRs and yields a route to spectral learning (Singh, James, and Rudary 2012; Boots and Gordon 2011).
Learning: modern PSR/TPSR work leverages spectral/subspace algorithms (linear algebra on empirical moments) that are statistically consistent and avoid EM’s local-optimum pathology, at least under standard assumptions (Boots and Gordon 2011).
9 CM vs PSRs: how they line up
They’re closer than the naming suggests:
CM’s causal state \(S_t\) is, in effect, the entire conditional law of the future given the past, up to equality. That is an (often infinite-dimensional) object.
A PSR state \(\mathbf s_t\) is a coordinate chart on (part of) that conditional law: we pick finitely many tests/features of the future and track their conditional probabilities. If the chosen tests span the relevant space (in the linear PSR sense), we can update and predict consistently within that model class (Singh, James, and Rudary 2012; Boots and Gordon 2011).
A useful way to think about it:
- CM is canonical and exact (given the true process), but can be large or infinite, and learning it can be delicate.
- PSRs are explicitly model-class/feature-dependent, but often yield finite-dimensional, learnable representations with efficient spectral methods, especially in controlled settings.
10 Minimal reading path
- CM foundations and optimality and uniqueness of causal states and ε-machines: (Shalizi and Crutchfield 2000).
- PSR foundations and the “two dominant approaches” (history vs generative) framing: (Littman and Sutton 2001).
- PSRs as a general theory with system-dynamics matrix and comparisons to Markov/HMM/POMDP: (Singh, James, and Rudary 2012).
- Spectral PSR/TPSR learning perspective and terminology (tests, histories) in a compact form: (Boots and Gordon 2011).
