Computational mechanics

the intersection somehow between macrostates, subjective updating, epistemic randomness, Szilard engines, Gibbs paradox…

2010-12-01 — 2026-02-14

Wherein computational mechanics is set forth: causal states are defined as pasts sharing a future law, an ε-machine is induced as a unifilar automaton, and Cμ is taken as stored predictive memory.

Bayes

dynamical systems

machine learning

neural nets

physics

pseudorandomness

sciml

statistics

statmech

stochastic processes

Related, somehow: algorithmic statistics, information geometry.

Computational mechanics is a theory of “state” for stochastic processes built around a criterion for state that wasn’t what I learned in my state estimation classes back in the day: State is, rather, what we need for optimal prediction from data. It formalizes what it means for a process to store information about its past that is relevant for its future, and it gives a canonical predictive model (the ε-machine) plus information-theoretic invariants (e.g. statistical complexity).

See (Shalizi and Crutchfield 2000).

1 Intellectual precedents

Several lines of thought converge here.

1.1 Physics stuff

Especially statistical mechanics and dynamical systems. In physics, we often want a reduced description that is predictive of macroscopic observables. Markov partitions, symbolic dynamics, ergodic theory, and the general “coarse-grain but preserve what matters” instinct sit in the background of CM. CM explicitly links its invariants to ergodic/information-theoretic quantities (Shalizi and Crutchfield 2000).

1.2 Information theory stuff

Computational mechanics is natively expressed in conditional distributions, entropies, mutual information, and minimal sufficient statistics. In practice, they tend to treat Shannon information as the canonical quantification of such quantities (Shalizi and Crutchfield 2000).

1.3 Formal language stuff

Automata and grammar inference. The ε-machine is a probabilistic, minimal predictive automaton: states are equivalence classes of pasts, transitions are labeled by symbols.

2 Intellectual competitors

Competing modelling formalisms include:

A. History-based models. \(k\)-th order Markov, variable-order Markov, suffix trees/contexts. These approximate state with a bounded window of the past.

B. Latent-state generative models. Hidden Markov models (HMMs), state-space models, dynamic Bayes nets; with control: POMDPs. These posit hidden variables that generate observations.

C. Predictive/observable-state models. Predictive State Representations (PSRs), Observable Operator Models (OOMs), and spectral/subspace system identification methods. These also define “state” via predictions of future observables, as coordinates in a finite-dimensional feature space Boots and Gordon (2011).

CM’s distinguishing move is: define state as the minimal statistic of the past that preserves the full conditional law of the future (not just finite horizons or chosen features), then study the structure and invariants of that object (Shalizi and Crutchfield 2000).

OK—to put that idea into action, let’s roll forward.

3 Let’s start with stationary processes

Let \((X_t)_{t\in\mathbb Z}\) be a stationary discrete-time stochastic process over a finite alphabet \(\mathcal A\). Define the semi-infinite past and future at time \(t\): \[ \overleftarrow{X}_t := (\ldots, X_{t-2}, X_{t-1}),\qquad \overrightarrow{X}_t := (X_t, X_{t+1}, \ldots). \] Measure-theoretically, conditioning on a past \(\overleftarrow{x}\) gives a probability measure on the σ-algebra of futures.

The prediction problem is to characterize the conditional law \[ \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}). \]

4 Causal states are predictive equivalence classes of histories

Let’s define an equivalence relation on pasts: \[ \overleftarrow{x}\sim \overleftarrow{x}' \quad\Longleftrightarrow\quad \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}) = \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t=\overleftarrow{x}'). \] A causal state is an equivalence class under the relation \(\sim\). Let \[ S_t := \epsilon(\overleftarrow{X}_t) \] be the random variable obtained by mapping a realized past to its class (the “causal-state map” \(\epsilon\)).

Key consequences:

First, Prescience (sufficiency for prediction). Conditioning on \(S_t\) is just as informative about the entire future as conditioning on the full past: \[ \mathbb P(\overrightarrow{X}_t \in \cdot \mid S_t) = \mathbb P(\overrightarrow{X}_t \in \cdot \mid \overleftarrow{X}_t), \] Equivalently, \(H(\overrightarrow{X}_t \mid S_t)=H(\overrightarrow{X}_t \mid \overleftarrow{X}_t)\).

Secondly, Minimality (no redundant memory). Among all equally prescient statistics \(R_t=\eta(\overleftarrow{X}_t)\), \(S_t\) is a minimal sufficient statistic: any prescient \(R_t\) refines (maps to) \(S_t\) a.s. (Shalizi and Crutchfield 2000).

This is the sense in which CM defines the “right” state: the one forced by predictive equivalence.

Sufficient statistics here are defined in an information-theoretic way: \(R_t\) is sufficient for predicting \(\overrightarrow{X}_t\) if \(I(\overrightarrow{X}_t; R_t \mid S_t)=0\) (i.e. \(R_t\) is conditionally independent of the future given \(S_t\)).

These look a lot like the sufficient statistics of exponential families, which we presumably recover for sufficiently nice models.

Quibble: It feels off to me to talk about this, first and foremost, in terms of this conditional independence property as informational. Sure, \(I(\overrightarrow{X}_t; R_t \mid S_t)=0\) is an information-theoretic condition, but there are plenty of other ways of expressing conditional independence that don’t make us think about estimating entropies and mutual information, which are notoriously cursed and come with all kinds of annoying problems when our models aren’t perfect—problems we often want to circumvent in practice.

From this point on, I have use the LLM a lot without adequate verification. Read on at your peril.

5 ε-machines

The ε-machine is the state-transition structure induced by causal states, giving us the canonical predictive model.

For each symbol \(a\in\mathcal A\), define labelled transition probabilities \[ T^{(a)}_{ij} := \mathbb P(X_t=a,\, S_{t+1}=s_j \mid S_t=s_i). \] We can view this as a labelled directed graph (a probabilistic automaton) whose nodes are causal states.

Term of art: unifilar.

A central structural property in CM is unifilarity: the next causal state is determined by the current causal state and the next observed symbol (i.e. state updates are deterministic given the observed symbol).

Let \((S_t, X_t)\) be an edge-emitting HMM with finite state set \(\mathcal S=\{1,\dots,n\}\) and output alphabet \(\mathcal A\). Define the symbol-labelled transition matrices \[ T^{(x)} \in [0,1]^{n\times n},\qquad T^{(x)}_{ij} := \mathbb P\!\left(X_t=x,\ S_{t+1}=j\ \middle|\ S_t=i\right), \quad x\in\mathcal A. \]

We write (right-resolving) unifilar as follows: \[ \forall i\in\mathcal S,\ \forall x\in\mathcal A:\quad \left|\left\{j\in\mathcal S:\ T^{(x)}_{ij}>0\right\}\right|\le 1. \tag{U1} \] Equivalently, each row of \(T^{(x)}\) has at most one nonzero entry.

Equivalent deterministic update map. There exists a (partial) function \(\delta:\mathcal S\times\mathcal A\to\mathcal S\) such that \[ T^{(x)}_{ij}=0 \quad\text{for all } j\neq \delta(i,x), \tag{U2} \] and whenever \(\mathbb P(X_t=x\mid S_t=i)>0\), \[ \mathbb P\!\left(S_{t+1}=\delta(i,x)\ \middle|\ S_t=i,\ X_t=x\right)=1. \tag{U3} \] Equivalent conditional entropy statement (under the usual “ignore conditioning on zero-probability events” convention): \[ H(S_{t+1}\mid S_t, X_t)=0. \tag{U4} \]

Graph form. For each state \(i\) and symbol \(x\), there is at most one outgoing edge from \(i\) labelled \(x\); its destination (if it exists) is \(\delta(i,x)\).

This makes filtering straightforward: as we observe symbols we track a single causal-state trajectory rather than a distribution over hidden states. CM develops optimality/uniqueness results around this representation (Shalizi and Crutchfield 2000).

By contrast, HMM states are latent and generally non-identifiable without additional assumptions; causal states are defined from observable conditional laws. Also, a minimal HMM (fewest hidden states) need not be unifilar; CM’s ε-machine is the minimal unifilar optimal predictor (conceptually: minimal “predictive memory” with a deterministic update).

6 Intrinsic information measures

Given the causal-state process \((S_t)\), CM defines invariants that separate randomness from structure.

Entropy rate \[ h_\mu := \lim_{L\to\infty} \frac{1}{L} H(X_{t:t+L-1}\mid \overleftarrow{X}_t) = H(X_t \mid \overleftarrow{X}_t) = H(X_t \mid S_t), \] The last equality uses prescience (Shalizi and Crutchfield 2000).
Statistical complexity \[ C_\mu := H(S_t). \] Interpretation: the average amount of information a process stores about the past that’s relevant for predicting the future (“predictive memory”) (Shalizi and Crutchfield 2000).
Excess entropy / predictive information \[ \mathbf E := I(\overleftarrow{X}_t;\overrightarrow{X}_t), \] The total past–future mutual information. Typically \(\mathbf E \le C_\mu\); the gap relates to “hidden” internal organization that isn’t directly manifest as past–future mutual information (often discussed under labels like crypticity in the CM literature) (Shalizi and Crutchfield 2000).

7 Inference/learning in CM

In practice, we don’t know the true conditional laws, so we estimate them and merge pasts whose estimated predictive distributions are close (via statistical tests, Bayesian nonparametrics, or algorithmic heuristics). CM emphasizes that the target is the causal-state partition and ε-machine, justified by the optimality/minimality results (Shalizi and Crutchfield 2000). The hard cases are the ones where the ε-machine has many (possibly countably infinite) causal states, or where finite data makes near-equivalences ambiguous.

8 Predictive State Representations (PSRs)

“state as a vector of predictions”

PSRs take that same predictive instinct and apply it to (often controlled) dynamical systems.

Setup (controlled I/O): at each time step, we choose an action \(a_t\), then observe \(o_t\). A test is a finite action sequence paired with a finite observation sequence; its value is the probability that, if we execute those actions from now, we will see those observations. A history is the past action/observation stream.

A PSR defines the system’s state at time \(t\) as a vector of such test probabilities: \[ \mathbf s_t := \big(\mathbb P(\text{test } q_i \text{ succeeds} \mid \text{history } h_t)\big)_{i=1}^d, \] For a chosen set of “core tests” \(q_1,\dots,q_d\), updates are performed by conditioning on the latest action/observation, typically via linear operators in the classical finite setting (Littman and Sutton 2001; Singh, James, and Rudary 2012; Boots and Gordon 2011).

Two key claims in the PSR line of work:

PSRs are an alternative to latent-state POMDP belief states: they are grounded in observable predictions and can be at least as compact as minimal latent models in relevant senses (Littman and Sutton 2001; Boots and Gordon 2011).
There is a clean linear-algebraic object—the system-dynamics matrix (rows = histories, columns = tests, entries = conditional test probabilities)—whose rank (“linear dimension”) governs the minimal size of linear PSRs and yields a route to spectral learning (Singh, James, and Rudary 2012; Boots and Gordon 2011).

Learning: modern PSR/TPSR work leverages spectral/subspace algorithms (linear algebra on empirical moments) that are statistically consistent and avoid EM’s local-optimum pathology, at least under standard assumptions (Boots and Gordon 2011).

9 CM vs PSRs: how they line up

They’re closer than the naming suggests:

CM’s causal state \(S_t\) is, in effect, the entire conditional law of the future given the past, up to equality. That is an (often infinite-dimensional) object.
A PSR state \(\mathbf s_t\) is a coordinate chart on (part of) that conditional law: we pick finitely many tests/features of the future and track their conditional probabilities. If the chosen tests span the relevant space (in the linear PSR sense), we can update and predict consistently within that model class (Singh, James, and Rudary 2012; Boots and Gordon 2011).

A useful way to think about it:

CM is canonical and exact (given the true process), but can be large or infinite, and learning it can be delicate.
PSRs are explicitly model-class/feature-dependent, but often yield finite-dimensional, learnable representations with efficient spectral methods, especially in controlled settings.

10 Minimal reading path

CM foundations and optimality and uniqueness of causal states and ε-machines: (Shalizi and Crutchfield 2000).
PSR foundations and the “two dominant approaches” (history vs generative) framing: (Littman and Sutton 2001).
PSRs as a general theory with system-dynamics matrix and comparisons to Markov/HMM/POMDP: (Singh, James, and Rudary 2012).
Spectral PSR/TPSR learning perspective and terminology (tests, histories) in a compact form: (Boots and Gordon 2011).

11 References

Atad, Mediano, Rosas, et al. 2025. “Meditation and Complexity: A Review and Synthesis of Evidence.”

Ay, Bernigau, Der, et al. 2012. “Information-Driven Self-Organization: The Dynamical System Approach to Autonomous Robot Behavior.” Theory in Biosciences.

Badii, and Politi. 1999. Complexity: Hierarchical Structures and Scaling in Physics. Cambridge Nonlinear Science Series.

Barnett, and Seth. 2023. “Dynamical Independence: Discovering Emergent Macroscopic Processes in Complex Dynamical Systems.” Physical Review E.

Bialek, Nemenman, and Tishby. 2001. “Complexity Through Nonextensivity.” Physica A: Statistical and Theoretical Physics.

———. 2006. “Predictability, Complexity, and Learning.” Neural Computation.

Boots, and Gordon. 2011. “An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems.” Proceedings of the AAAI Conference on Artificial Intelligence.

Boyd, A. B., Crutchfield, and Gu. 2020. “Thermodynamic Machine Learning Through Maximum Work Production.”

Boyd, Alexander, Nowak, Hyland, et al. 2025. “From Monoliths to Modules: Decomposing Transducers for Efficient World Modelling.”

Carleo, Cirac, Cranmer, et al. 2019. “Machine Learning and the Physical Sciences.” Reviews of Modern Physics.

Crutchfield, James P. 1992. “Semantics and Thermodynamics.” In Nonlinear Modeling and Forecasting. Santa Fe Institute Studies in the Sciences of Complexity, XII.

Crutchfield, James P. 1994. “The Calculi of Emergence: Computation, Dynamics and Induction.” Physica D: Nonlinear Phenomena.

Crutchfield, James P., and Jurgens. 2025. “Agentic Information Theory: Ergodicity and Intrinsic Semantics of Information Processes.”

Crutchﬁeld, and Young. 1988. “Computation at the Onset of Chaos.”

Crutchfield, James P., and Young. 1989. “Inferring Statistical Complexity.” Physical Review Letters.

Dupont, Denis, and Esposito. 2005. “Links Between Probabilistic Automata and Hidden Markov Models: Probability Distributions, Learning Models and Induction Algorithms.” Pattern Recognition.

Gourieroux, and Jasiak. 2015. “Filtering, Prediction and Simulation Methods for Noncausal Processes.” Journal of Time Series Analysis.

Kaplanis, Mediano, and Rosas. 2023. “Learning Causally Emergent Representations.” In.

Klyubin, Polani, and Nehaniv. 2005. “Empowerment: A Universal Agent-Centric Measure of Control.” In 2005 IEEE Congress on Evolutionary Computation.

Kolchinsky, Tracey, and Wolpert. 2019. “Nonlinear Information Bottleneck.” Entropy.

Littman, and Sutton. 2001. “Predictive Representations of State.” In Advances in Neural Information Processing Systems.

Lizier, Joseph T, and Prokopenko. 2010. “Differentiating Information Transfer and Causal Effect.” The European Physical Journal B - Condensed Matter and Complex Systems.

Lizier, Joseph, Prokopenko, and Zomaya. 2007. “Detecting Non-Trivial Computation in Complex Dynamics.” In Advances in Artificial Life. Lecture Notes in Computer Science.

Lizier, Joseph T, Prokopenko, and Zomaya. 2008. “Local Information Transfer as a Spatiotemporal Filter for Complex Systems.” Physical Review E.

Marzen, and Crutchfield. 2020. “Inference, Prediction, and Entropy-Rate Estimation of Continuous-Time, Discrete-Event Processes.” arXiv:2005.03750 [Cond-Mat, Physics:nlin, Stat].

Milinkovic, Barnett, Carter, et al. 2025. “Capturing the Emergent Dynamical Structure in Biophysical Neural Models.” PLOS Computational Biology.

Mitchell, M., Crutchfield, and Hraber. 1993. “Dynamics, Computation, and the ‘Edge of Chaos’: A Re-Examination.”

Mitchell, Melanie, Hraber, and Crutchfield. 1993. “Revisiting the Edge of Chaos: Evolving Cellular Automata to Perform Computations.” arXiv:adap-Org/9303003.

Moshkovitz, and Tishby. 2017. “A General Memory-Bounded Learning Algorithm.” arXiv:1712.03524 [Cs].

Polani. 2008. “Foundations and Formalizations of Self-Organization.” Advances in Applied Self-Organizing Systems.

Prokopenko, Ay, Obst, et al. 2010. “Phase Transitions in Least-Effort Communications.” Journal of Statistical Mechanics: Theory and Experiment.

Prokopenko, Boschetti, and Ryan. 2009. “An Information-Theoretic Primer on Complexity, Self-Organization, and Emergence.” Complexity.

Prokopenko, Davies, Harré, et al. 2024. “Biological Arrow of Time: Emergence of Tangled Information Hierarchies and Self-Modelling Dynamics.”

Riechers, Bigelow, Alt, et al. 2025. “Next-Token Pretraining Implies in-Context Learning.”

Ron, Singer, and Tishby. 1996. “The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length.” Machine Learning.

Rosas, Fernando, Boyd, and Baltieri. 2025. “AI in a Vat: Fundamental Limits of Efficient World Modelling for Agent Sandboxing and Interpretability.” In.

Rosas, Fernando E., Geiger, Luppi, et al. 2024. “Software in the Natural World: A Computational Approach to Hierarchical Emergence.”

Salge, Glackin, and Polani. 2014. “Empowerment–An Introduction.” In Guided Self-Organization: Inception.

Sato, and Crutchfield. 2003. “Coupled Replicator Equations for the Dynamics of Learning in Multiagent Systems.” Physical Review E.

Shai, Marzen, Teixeira, et al. 2025. “Transformers Represent Belief State Geometry in Their Residual Stream.”

Shalizi, and Crutchfield. 2000. “Computational Mechanics: Pattern and Prediction, Structure and Simplicity.”

Shalizi, and Crutchfield. 2002. “Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” Advances in Complex Systems.

Shalizi, and Shalizi. 2004. “Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences.” In.

Singh, James, and Rudary. 2012. “Predictive State Representations: A New Theory for Modeling Dynamical Systems.”

Strelioff, Crutchfield, and Hübler. 2007. “Inferring Markov Chains: Bayesian Estimation, Model Comparison, Entropy Rate, and Out-of-Class Modeling.” Phys. Rev. E.

Tishby, and Polani. 2011. “Information Theory of Decisions and Actions.” In PERCEPTION-ACTION CYCLE.

Wolpert, David H. 2006a. “Information Theory — The Bridge Connecting Bounded Rational Game Theory and Statistical Physics.” In Complex Engineered Systems. Understanding Complex Systems.

———. 2006b. “What Information Theory Says about Bounded Rational Best Response.” In The Complex Networks of Economic Interactions. Lecture Notes in Economics and Mathematical Systems 567.

Wolpert, David H. 2008. “Physical Limits of Inference.” Physica D: Nonlinear Phenomena, Novel Computing Paradigms: Quo Vadis?,.

———. 2018. “Overview of Information Theory, Computer Science Theory, and Stochastic Thermodynamics for Thermodynamics of Computation.”

———. 2019. “Stochastic Thermodynamics of Computation.”

Wolpert, David, and Scharnhorst. 2024. “Stochastic Process Turing Machines.”