Homunculi all the way down
Formal models of minds that model themselves and each other
2026-04-15 — 2026-04-16
Wherein the formalisms by which a social agent may carry a compressed model of another agent — and of itself — are surveyed across three axes: recursion depth, representation rank, and self-referential reflectivity.
Research-background notes. I want to pin down what it would mean, formally, for a social entity to contain a reduced-rank model of another social entity — possibly even a reduced-rank model of itself. This is a literature scan of places where such formalisms already exist, leaning upon LLM lit review, some PDFs I had in a folder and some vibesy dot points I sketched out at the PIBBS x ILIAD residency.
1 A phenomenon of note
A mind modelling another mind is an agent embedded in an environment that contains agents with comparable representational capacity to itself. If the only faithful model of Alice is Alice, then Bob cannot fit one in his head. So Bob must carry a compressed Alice: fewer parameters, coarser predictions, maybe with a cartoon-level ontology. Call this a reduced-rank other-model.
Bob must also act, and acting well requires that Bob predict his own future behaviour. If the only faithful model of Bob is Bob, he cannot fit one either. So Bob carries a “reduced-rank” self-model. This self-model is what Metzinger calls the phenomenal self-model (Metzinger2003Beingb?), what Graziano’s Attention Schema Theory (Graziano2013Consciousnessa?) makes into a neural control-theoretic object, and what Schmidhuber-flavoured AI calls a world-model-containing-self (Ha and Schmidhuber 2018).
The bicameral-mind literature (Jaynes1976Origin?) gestures at a related phenomenology — the sense that “I” am addressed by a voice that is also “I” — but it is not formal enough to build on. I want formalisms that admit theorems or implementations.
Three axes of interest:
- Other-modelling. How do formalisms represent nested belief (“I think that you think that I think…”)?
- Self-modelling. How do formalisms represent an agent that contains a compressed simulacrum of itself?
- Reduced rank. How is the “reduction” made rigorous — rate-distortion, PAC bounds, etc.?
2 Other-models: the formal landscape
2.1 Interactive POMDPs
Gmytrasiewicz and Doshi’s interactive partially observable Markov decision processes (I-POMDPs) (Gmytrasiewicz and Doshi 2005) give a clean formulation of recursive belief. The state space is augmented with models of the other agents, which themselves include models of this agent, and so on. A finitely-nested I-POMDP truncates the recursion at level \(k\) — agents at level 0 treat others as noise; level 1 models level-0s; level 2 models level-1s; and so on. This is an operationalisation of “reduced rank”: the recursion is cut off, and the depth is a tunable resource.
Related work in game theory:
- Level-k / cognitive hierarchy models in behavioural game theory (Camerer, Ho, and Chong 2004), where players assume others are reasoning at a lower level than themselves. Empirically, humans cluster at \(k \in \{0,1,2\}\).
- Quantal response equilibrium (McKelvey and Palfrey 1995), where bounded rationality is modelled by stochastic best-response rather than a deeper recursion.
- Epistemic game theory (Dekel and Siniscalchi 2015), which formalises common knowledge, common belief, and the belief hierarchies above.
2.2 Bayesian Theory of Mind
Baker, Saxe, and Tenenbaum (C. L. Baker et al. 2017; C. Baker, Saxe, and Tenenbaum 2011) formalise human social cognition as inverse planning: observers invert a generative model of rational action to infer the latent goals and beliefs of others. The other-model here is the generative model — typically a small MDP or POMDP parameterised by a utility and belief — and inference is Bayesian. This gives us a concrete posterior over other minds that one can compute with and prove things about.
2.3 Machine Theory of Mind
Rabinowitz et al.’s ToMnet (Rabinowitz et al. 2018) is a deep-learning analogue: a meta-learning agent that, from a few observations of a target agent, infers an embedding which predicts the target’s future behaviour. The embedding is the reduced-rank other-model. ToMnet variants have been extended to false-belief tasks and inverse-RL settings (Oguntola et al. 2023).
2.4 Opponent-modelling in multi-agent RL
- LOLA (Learning with Opponent-Learning Awareness) (Foerster et al. 2018) computes gradients through a model of the opponent’s learning dynamics. This is a differentiable other-model.
- COLA, POLA, M-FOS and successors refine LOLA with higher-order or policy-level models (Willi et al. 2022; Zhao et al. 2022; Lu et al. 2022).
- Opponent shaping more generally treats the other agent as a learnable dynamical system, which is a particular operationalisation of “the other is a reduced-rank version of me”.
2.5 LLMs and emergent theory of mind
The question of whether transformer language models contain an implicit theory of mind has generated a cottage industry (Kosinski2023Theoryb?; Ullman2023Largea?; Sclar et al. 2023; Gandhi et al. 2023). The cautious answer seems to be that they carry shallow heuristics that look like ToM on canonical tasks and break on adversarial ones. Whatever they do have is, almost by construction, a reduced-rank model: compressed into attention patterns and residual-stream features.
3 Self-models: the formal landscape
3.1 World models containing self
The cleanest ML instantiation is Ha & Schmidhuber’s World Models (Ha and Schmidhuber 2018), where a recurrent latent model predicts both environment dynamics and the consequences of the agent’s own actions. The agent’s policy is trained inside this compressed dreamscape. The self here is a reduced-rank conditional — “what would my controller do, given this latent” — rather than an introspectable entity, but the compression is real.
The lineage continues through Dreamer(Hafner et al. 2024), MuZero (Schrittwieser2020Masteringa?), and the larger world-model-RL programme. See also world models.
3.2 Active inference and the self as generative model
Active inference (Friston et al. 2017; Parr, Pezzulo, and Friston 2022) treats the agent as a generative model of its own sensorium, including its own actions. Free-energy minimisation forces the self-model to be as compressed as is consistent with prediction — a direct rate-distortion pressure. The self here is a probabilistic model with the agent’s own observation-action trajectory as a latent.
3.3 Self-modelling robots
A beautifully concrete line: Bongard, Zykov, and Lipson’s Resilient Machines Through Continuous Self-Modeling (Bongard, Zykov, and Lipson 2006) — a quadruped robot that learns a forward model of its own body, then uses it to plan locomotion; when a limb is damaged, the model updates, and the robot recovers. The self-model is an explicit, parameterised, low-rank dynamical system. See the follow-up (Kwiatkowski and Lipson 2019) for differentiable variants.
3.4 Attention Schema Theory
Graziano (Graziano2013Consciousnessa?; Graziano et al. 2019) argues that consciousness is the brain’s (incomplete, schematic) model of its own attention. This is explicitly a reduced-rank model: the schema is coarser than the machinery it represents, because representing attention in full would require as much machinery as attention itself. Kaplan, Dolan, and colleagues have begun to operationalise this in neural-network models (Wilterson and Graziano 2021).
3.5 Schmidhuber and reflective learners
Schmidhuber’s early work on self-referential neural networks (Schmidhuber 1993) and later Gödel machines (Schmidhuber2003Goedel?) formalises learners that inspect and modify their own code, subject to provability constraints. The Gödel-machine construction is where the proof-theoretic aspect of self-modelling seems to cause grief: self-modification is gated by a proof that the modification improves expected utility.
3.6 Predictive coding and hierarchical self
Hierarchical predictive coding architectures (Rao and Ballard 1999; Clark 2013) include top-down predictions that span the whole sensory hierarchy, including proprioceptive and interoceptive signals — i.e., representations of the organism. See predictive coding.
4 Reducing fidelity of representation
Several toolkits formalise “reduced rank”:
- Rate-distortion theory applied to cognition (Sims2012Ideala?; Zenon2019Informationtheoretica?; Lai and Gershman 2021): the cost of mental representation is an information-theoretic rate, the benefit is task performance, and optimal bounded agents sit on the rate-distortion frontier.
- Information bottleneck (Tishby, Pereira, and Bialek 2000; Alemi et al. 2019): compress inputs to a latent that is maximally informative about a downstream variable. When the downstream variable is “the other agent’s next action”, the bottleneck is a reduced-rank other-model.
- Resource-rational analysis (Lieder2019Resourcerational?): agents are optimal given bounded compute; the bound is the reduction.
- Successor representations / features (Dayan 1993; Barreto2016Successor?): compressed future-prediction models that generalise well across reward functions. A kind of reduced-rank self-model of one’s own policy.
- Bounded rationality as a research programme (Simon 1955; S. J. Russell and Subramanian 1995; S. Russell 2016).
- Epsilon-machines / computational mechanics (Crutchfield and Young 1989; Shalizi and Crutchfield 2000): minimal-sufficient-statistic models of a process. The causal states are the provably minimum-rank predictor.
For multi-agent compressions specifically, see theory of mind as mutual information (Jara-Ettinger 2019) and the recent graph-theoretic accounts of social abstraction (Stolk, Verhagen, and Toni 2016).
5 Modular / bicameral architectures
The bicameral intuition — “the mind is many minds talking” — has several formal incarnations, none of which cite Jaynes.
- Minsky’s Society of Mind (Minsky 1986): not formal, but programmatic. Each “agent” is a specialist; the society is the mind.
- Global Workspace Theory (Baars 1993; Dehaene2014Consciousness?): many specialist modules compete for broadcast to a low-capacity global workspace. Formalised in neural-network terms by VanRullen & Kanai (VanRullen and Kanai 2021) and in RL by Goyal et al.’s Coordination Among Neural Modules Through a Shared Global Workspace (Goyal et al. 2021). The workspace is an explicit reduced-rank bottleneck through which modules exchange self-models and other-models.
- Mixture of experts (Jacobs et al. 1991; Shazeer et al. 2017): gating networks that route inputs to specialists. When the experts have self-models of their own confidence and competence, we recover a society.
- Dennett’s multiple drafts (Dennett1991Consciousness?): philosophical, not formal, but the architectural proposal is that “the self” is a posteriorly constructed narrative over parallel processes — compatible with global-workspace formalisms.
- Attention Schema Theory (above) fits naturally here.
6 Self-referential agents and the proof-theoretic frontier
If we want to prove things about minds that model themselves, we hit self-reference, and self-reference hits Löb’s theorem and friends.
- Reflective oracles (Fallenstein, Taylor, and Christiano 2015): a construction of probability distributions closed under self-reference, solving the naive inconsistency of an agent that reasons about its own beliefs.
- Logical induction (Garrabrant et al. 2020): a market-based learner whose beliefs about its own future beliefs converge, evading the diagonal pathologies.
- Löbian obstacle to self-trust (Yudkowsky and Herreshoff 2013): a self-modifying agent that is unwilling to endorse a successor with the same deductive power runs into Löb’s theorem.
- Modal combat agents and program equilibrium (Bárász et al. 2014; CRITCH2019PARAMETRIC?): agents that condition on source-code-level models of each other; admits bona fide equilibria in the one-shot prisoner’s dilemma.
- AIXI and approximations (Hutter 2005; Leike et al. 2016): a formally optimal agent whose self-model is implicit in the universal prior; computable approximations (e.g., AIXI-tl, MC-AIXI) buy tractability at the cost of reducing the rank of the prior.
MIRI’s Agent Foundations programme is the main hub for this line.
7 Let’s attempt synthesis!
There are three independent knobs:
- Depth of recursion (levels of “I think that you think”): cut off cleanly by I-POMDPs and level-k.
- Rank of representation (bits spent per other, per self): quantified by rate-distortion, information bottleneck, successor features.
- Reflectivity (the model contains a model of itself): addressed by reflective oracles, logical induction, Gödel machines.
A minimally rich formalism of “a social entity containing reduced-rank models of other social entities and of itself” would specify all three. I don’t know of a paper that does this cleanly end-to-end. Candidates that come close:
- Active inference with hierarchical generative models covers 2 and partially 3.
- I-POMDPs with neural network model approximators (e.g., (Han and Gmytrasiewicz 2019)) covers 1 and 2.
- Modular world models with opponent-shaping (LOLA-style inside a Dreamer-style world model) covers 1, 2, and implicitly 3.
8 Proof-and-implement directions
If one wanted to prove things about such minds:
- Regret bounds for level-\(k\) agents against level-\(\ell\) opponents under rank-\(r\) representation. Does regret decompose into a recursion-depth term, a rank term, and an unavoidable opponent-class term?
- Rate-distortion frontier for social prediction: for a population with known dynamics, what is the minimum bits per agent to achieve a given prediction accuracy? There is a small literature on this (Bilodeau2023Minimaxa?) but nothing I find definitive.
- Fixed-point existence for mutually-modelling agents under neural network function approximation — when do recursive self-consistency constraints admit tractable solutions?
- Distillation as self-modelling: policy distillation from a capable teacher to a smaller student is a concrete reduction-of-rank operation. Does the student, in the limit, acquire a faithful-but-compressed self-model?
If one wanted to implement them:
- Drop-in: ToMnet + Dreamer + LOLA in a single loop.
- Ambitious: a global-workspace bottleneck shared across agent modules, where other-models and self-models live in the same reduced-rank latent space.
- Wilder: reflective oracles as a probabilistic programming primitive, exposed to a neural policy as a callable.
10 Incoming
- Dennett and Hofstadter’s The Mind’s I as a pre-formal reading list.
- Hofstadter’s strange loops (Hofstadter 2008): evocative, not formal, but a useful bridge.
- Tononi’s IIT as a measure on self-models, if one is feeling brave.
- Whether the “observer self” of contemplative traditions corresponds to an attention-schema-style reduced-rank self-model — this seems to be Metzinger’s read.
