Homunculi all the way down

Formal models of minds that model themselves and each other

2026-04-15 — 2026-04-16

Wherein the formalisms by which a social agent may carry a compressed model of another agent — and of itself — are surveyed across three axes: recursion depth, representation rank, and self-referential reflectivity.

agent foundations
bounded compute
cooperation
learning
mind
neural nets
probability
reinforcement learning
theory of mind
wonk

Research-background notes. I want to pin down what it would mean, formally, for a social entity to contain a reduced-rank model of another social entity — possibly even a reduced-rank model of itself. This is a literature scan of places where such formalisms already exist, leaning upon LLM lit review, some PDFs I had in a folder and some vibesy dot points I sketched out at the PIBBS x ILIAD residency.

Figure 1

1 A phenomenon of note

A mind modelling another mind is an agent embedded in an environment that contains agents with comparable representational capacity to itself. If the only faithful model of Alice is Alice, then Bob cannot fit one in his head. So Bob must carry a compressed Alice: fewer parameters, coarser predictions, maybe with a cartoon-level ontology. Call this a reduced-rank other-model.

Bob must also act, and acting well requires that Bob predict his own future behaviour. If the only faithful model of Bob is Bob, he cannot fit one either. So Bob carries a “reduced-rank” self-model. This self-model is what Metzinger calls the phenomenal self-model (Metzinger2003Beingb?), what Graziano’s Attention Schema Theory (Graziano2013Consciousnessa?) makes into a neural control-theoretic object, and what Schmidhuber-flavoured AI calls a world-model-containing-self (Ha and Schmidhuber 2018).

The bicameral-mind literature (Jaynes1976Origin?) gestures at a related phenomenology — the sense that “I” am addressed by a voice that is also “I” — but it is not formal enough to build on. I want formalisms that admit theorems or implementations.

Three axes of interest:

  1. Other-modelling. How do formalisms represent nested belief (“I think that you think that I think…”)?
  2. Self-modelling. How do formalisms represent an agent that contains a compressed simulacrum of itself?
  3. Reduced rank. How is the “reduction” made rigorous — rate-distortion, PAC bounds, etc.?

2 Other-models: the formal landscape

2.1 Interactive POMDPs

Gmytrasiewicz and Doshi’s interactive partially observable Markov decision processes (I-POMDPs) (Gmytrasiewicz and Doshi 2005) give a clean formulation of recursive belief. The state space is augmented with models of the other agents, which themselves include models of this agent, and so on. A finitely-nested I-POMDP truncates the recursion at level \(k\) — agents at level 0 treat others as noise; level 1 models level-0s; level 2 models level-1s; and so on. This is an operationalisation of “reduced rank”: the recursion is cut off, and the depth is a tunable resource.

Related work in game theory:

  • Level-k / cognitive hierarchy models in behavioural game theory (Camerer, Ho, and Chong 2004), where players assume others are reasoning at a lower level than themselves. Empirically, humans cluster at \(k \in \{0,1,2\}\).
  • Quantal response equilibrium (McKelvey and Palfrey 1995), where bounded rationality is modelled by stochastic best-response rather than a deeper recursion.
  • Epistemic game theory (Dekel and Siniscalchi 2015), which formalises common knowledge, common belief, and the belief hierarchies above.

2.2 Bayesian Theory of Mind

Baker, Saxe, and Tenenbaum (C. L. Baker et al. 2017; C. Baker, Saxe, and Tenenbaum 2011) formalise human social cognition as inverse planning: observers invert a generative model of rational action to infer the latent goals and beliefs of others. The other-model here is the generative model — typically a small MDP or POMDP parameterised by a utility and belief — and inference is Bayesian. This gives us a concrete posterior over other minds that one can compute with and prove things about.

2.3 Machine Theory of Mind

Rabinowitz et al.’s ToMnet (Rabinowitz et al. 2018) is a deep-learning analogue: a meta-learning agent that, from a few observations of a target agent, infers an embedding which predicts the target’s future behaviour. The embedding is the reduced-rank other-model. ToMnet variants have been extended to false-belief tasks and inverse-RL settings (Oguntola et al. 2023).

2.4 Opponent-modelling in multi-agent RL

  • LOLA (Learning with Opponent-Learning Awareness) (Foerster et al. 2018) computes gradients through a model of the opponent’s learning dynamics. This is a differentiable other-model.
  • COLA, POLA, M-FOS and successors refine LOLA with higher-order or policy-level models (Willi et al. 2022; Zhao et al. 2022; Lu et al. 2022).
  • Opponent shaping more generally treats the other agent as a learnable dynamical system, which is a particular operationalisation of “the other is a reduced-rank version of me”.

2.5 LLMs and emergent theory of mind

The question of whether transformer language models contain an implicit theory of mind has generated a cottage industry (Kosinski2023Theoryb?; Ullman2023Largea?; Sclar et al. 2023; Gandhi et al. 2023). The cautious answer seems to be that they carry shallow heuristics that look like ToM on canonical tasks and break on adversarial ones. Whatever they do have is, almost by construction, a reduced-rank model: compressed into attention patterns and residual-stream features.

3 Self-models: the formal landscape

3.1 World models containing self

The cleanest ML instantiation is Ha & Schmidhuber’s World Models (Ha and Schmidhuber 2018), where a recurrent latent model predicts both environment dynamics and the consequences of the agent’s own actions. The agent’s policy is trained inside this compressed dreamscape. The self here is a reduced-rank conditional — “what would my controller do, given this latent” — rather than an introspectable entity, but the compression is real.

The lineage continues through Dreamer(Hafner et al. 2024), MuZero (Schrittwieser2020Masteringa?), and the larger world-model-RL programme. See also world models.

3.2 Active inference and the self as generative model

Active inference (Friston et al. 2017; Parr, Pezzulo, and Friston 2022) treats the agent as a generative model of its own sensorium, including its own actions. Free-energy minimisation forces the self-model to be as compressed as is consistent with prediction — a direct rate-distortion pressure. The self here is a probabilistic model with the agent’s own observation-action trajectory as a latent.

3.3 Self-modelling robots

A beautifully concrete line: Bongard, Zykov, and Lipson’s Resilient Machines Through Continuous Self-Modeling (Bongard, Zykov, and Lipson 2006) — a quadruped robot that learns a forward model of its own body, then uses it to plan locomotion; when a limb is damaged, the model updates, and the robot recovers. The self-model is an explicit, parameterised, low-rank dynamical system. See the follow-up (Kwiatkowski and Lipson 2019) for differentiable variants.

3.4 Attention Schema Theory

Graziano (Graziano2013Consciousnessa?; Graziano et al. 2019) argues that consciousness is the brain’s (incomplete, schematic) model of its own attention. This is explicitly a reduced-rank model: the schema is coarser than the machinery it represents, because representing attention in full would require as much machinery as attention itself. Kaplan, Dolan, and colleagues have begun to operationalise this in neural-network models (Wilterson and Graziano 2021).

3.5 Schmidhuber and reflective learners

Schmidhuber’s early work on self-referential neural networks (Schmidhuber 1993) and later Gödel machines (Schmidhuber2003Goedel?) formalises learners that inspect and modify their own code, subject to provability constraints. The Gödel-machine construction is where the proof-theoretic aspect of self-modelling seems to cause grief: self-modification is gated by a proof that the modification improves expected utility.

3.6 Predictive coding and hierarchical self

Hierarchical predictive coding architectures (Rao and Ballard 1999; Clark 2013) include top-down predictions that span the whole sensory hierarchy, including proprioceptive and interoceptive signals — i.e., representations of the organism. See predictive coding.

4 Reducing fidelity of representation

Several toolkits formalise “reduced rank”:

For multi-agent compressions specifically, see theory of mind as mutual information (Jara-Ettinger 2019) and the recent graph-theoretic accounts of social abstraction (Stolk, Verhagen, and Toni 2016).

5 Modular / bicameral architectures

The bicameral intuition — “the mind is many minds talking” — has several formal incarnations, none of which cite Jaynes.

  • Minsky’s Society of Mind (Minsky 1986): not formal, but programmatic. Each “agent” is a specialist; the society is the mind.
  • Global Workspace Theory (Baars 1993; Dehaene2014Consciousness?): many specialist modules compete for broadcast to a low-capacity global workspace. Formalised in neural-network terms by VanRullen & Kanai (VanRullen and Kanai 2021) and in RL by Goyal et al.’s Coordination Among Neural Modules Through a Shared Global Workspace (Goyal et al. 2021). The workspace is an explicit reduced-rank bottleneck through which modules exchange self-models and other-models.
  • Mixture of experts (Jacobs et al. 1991; Shazeer et al. 2017): gating networks that route inputs to specialists. When the experts have self-models of their own confidence and competence, we recover a society.
  • Dennett’s multiple drafts (Dennett1991Consciousness?): philosophical, not formal, but the architectural proposal is that “the self” is a posteriorly constructed narrative over parallel processes — compatible with global-workspace formalisms.
  • Attention Schema Theory (above) fits naturally here.

6 Self-referential agents and the proof-theoretic frontier

If we want to prove things about minds that model themselves, we hit self-reference, and self-reference hits Löb’s theorem and friends.

  • Reflective oracles (Fallenstein, Taylor, and Christiano 2015): a construction of probability distributions closed under self-reference, solving the naive inconsistency of an agent that reasons about its own beliefs.
  • Logical induction (Garrabrant et al. 2020): a market-based learner whose beliefs about its own future beliefs converge, evading the diagonal pathologies.
  • Löbian obstacle to self-trust (Yudkowsky and Herreshoff 2013): a self-modifying agent that is unwilling to endorse a successor with the same deductive power runs into Löb’s theorem.
  • Modal combat agents and program equilibrium (Bárász et al. 2014; CRITCH2019PARAMETRIC?): agents that condition on source-code-level models of each other; admits bona fide equilibria in the one-shot prisoner’s dilemma.
  • AIXI and approximations (Hutter 2005; Leike et al. 2016): a formally optimal agent whose self-model is implicit in the universal prior; computable approximations (e.g., AIXI-tl, MC-AIXI) buy tractability at the cost of reducing the rank of the prior.

MIRI’s Agent Foundations programme is the main hub for this line.

7 Let’s attempt synthesis!

There are three independent knobs:

  1. Depth of recursion (levels of “I think that you think”): cut off cleanly by I-POMDPs and level-k.
  2. Rank of representation (bits spent per other, per self): quantified by rate-distortion, information bottleneck, successor features.
  3. Reflectivity (the model contains a model of itself): addressed by reflective oracles, logical induction, Gödel machines.

A minimally rich formalism of “a social entity containing reduced-rank models of other social entities and of itself” would specify all three. I don’t know of a paper that does this cleanly end-to-end. Candidates that come close:

  • Active inference with hierarchical generative models covers 2 and partially 3.
  • I-POMDPs with neural network model approximators (e.g., (Han and Gmytrasiewicz 2019)) covers 1 and 2.
  • Modular world models with opponent-shaping (LOLA-style inside a Dreamer-style world model) covers 1, 2, and implicitly 3.

8 Proof-and-implement directions

If one wanted to prove things about such minds:

  • Regret bounds for level-\(k\) agents against level-\(\ell\) opponents under rank-\(r\) representation. Does regret decompose into a recursion-depth term, a rank term, and an unavoidable opponent-class term?
  • Rate-distortion frontier for social prediction: for a population with known dynamics, what is the minimum bits per agent to achieve a given prediction accuracy? There is a small literature on this (Bilodeau2023Minimaxa?) but nothing I find definitive.
  • Fixed-point existence for mutually-modelling agents under neural network function approximation — when do recursive self-consistency constraints admit tractable solutions?
  • Distillation as self-modelling: policy distillation from a capable teacher to a smaller student is a concrete reduction-of-rank operation. Does the student, in the limit, acquire a faithful-but-compressed self-model?

If one wanted to implement them:

  • Drop-in: ToMnet + Dreamer + LOLA in a single loop.
  • Ambitious: a global-workspace bottleneck shared across agent modules, where other-models and self-models live in the same reduced-rank latent space.
  • Wilder: reflective oracles as a probabilistic programming primitive, exposed to a neural policy as a callable.

10 Incoming

  • Dennett and Hofstadter’s The Mind’s I as a pre-formal reading list.
  • Hofstadter’s strange loops (Hofstadter 2008): evocative, not formal, but a useful bridge.
  • Tononi’s IIT as a measure on self-models, if one is feeling brave.
  • Whether the “observer self” of contemplative traditions corresponds to an attention-schema-style reduced-rank self-model — this seems to be Metzinger’s read.

11 References

Alemi, Fischer, Dillon, et al. 2019. Deep Variational Information Bottleneck.”
Baars. 1993. A Cognitive Theory of Consciousness.
Baker, Chris L., Jara-Ettinger, Saxe, et al. 2017. Rational Quantitative Attribution of Beliefs, Desires and Percepts in Human Mentalizing.” Nature Human Behaviour.
Baker, Chris, Saxe, and Tenenbaum. 2011. Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution.” Proceedings of the Annual Meeting of the Cognitive Science Society.
Bárász, Christiano, Fallenstein, et al. 2014. Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic.” arXiv.org.
Barreto, Dabney, Munos, et al. 2017. Successor Features for Transfer in Reinforcement Learning.” In Advances in Neural Information Processing Systems.
Bilodeau, Foster, and Roy. 2023. Minimax Rates for Conditional Density Estimation via Empirical Entropy.” The Annals of Statistics.
Bongard, Zykov, and Lipson. 2006. Resilient Machines Through Continuous Self-Modeling.” Science.
Camerer, Ho, and Chong. 2004. A Cognitive Hierarchy Model of Games.” The Quarterly Journal of Economics.
Clark. 2013. Whatever Next? Predictive Brains, Situated Agents, and the Future of Cognitive Science.” Behavioral and Brain Sciences.
Critch. 2019. A Parametric, Resource-Bounded Generalization of Löb’s Theorem, and a Robust Cooperation Criterion for Open-Source Game Theory.” The Journal of Symbolic Logic.
Crutchfield, and Young. 1989. Inferring Statistical Complexity.” Physical Review Letters.
Dayan. 1993. Improving Generalization for Temporal Difference Learning: The Successor Representation.” Neural Computation.
Dekel, and Siniscalchi. 2015. Epistemic Game Theory.” In.
Fallenstein, Taylor, and Christiano. 2015. Reflective Oracles: A Foundation for Classical Game Theory.”
Feng, and Kirkley. 2020. Online Geolocalized Emotion Across US Cities During the COVID Crisis: Universality, Policy Response, and Connection with Local Mobility.” arXiv.org.
Foerster, Chen, Al-Shedivat, et al. 2018. Learning with Opponent-Learning Awareness.” In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’18.
Friston, FitzGerald, Rigoli, et al. 2017. Active Inference: A Process Theory.” Neural Computation.
Gandhi, Franken, Gerstenberg, et al. 2023. Understanding Social Reasoning in Language Models with Language Models.” Neural Information Processing Systems.
Garrabrant, Benson-Tilsen, Critch, et al. 2020. Logical Induction.”
Gmytrasiewicz, and Doshi. 2005. A Framework for Sequential Planning in Multi-Agent Settings.” Journal of Artificial Intelligence Research.
Goyal, Didolkar, Lamb, et al. 2021. Coordination Among Neural Modules Through a Shared Global Workspace.” International Conference on Learning Representations.
Graziano. 2013. Consciousness and the Social Brain.
Graziano, Guterstam, Bio, et al. 2019. Toward a Standard Model of Consciousness: Reconciling the Attention Schema, Global Workspace, Higher-Order Thought, and Illusionist Theories.” Cognitive Neuropsychology.
Hafner, Pasukonis, Ba, et al. 2024. Mastering Diverse Domains Through World Models.”
Han, and Gmytrasiewicz. 2019. IPOMDP-Net: A Deep Neural Network for Partially Observable Multi-Agent Planning Using Interactive POMDPs.” In AAAI Conference on Artificial Intelligence.
Ha, and Schmidhuber. 2018. World Models.” arXiv.org.
Hofstadter. 2008. I Am a Strange Loop: By Douglas R. Hofstadter: 0.
Hutter. 2005. Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Texts in Theoretical Computer Science.
Jacobs, Jordan, Nowlan, et al. 1991. Adaptive Mixtures of Local Experts.” Neural Computation.
Jara-Ettinger. 2019. Theory of Mind as Inverse Reinforcement Learning.” Current Opinion in Behavioral Sciences.
Kosinski. 2023. Theory of Mind May Have Spontaneously Emerged in Large Language Models.” Research Papers, Research Papers,.
———. 2024. Evaluating Large Language Models in Theory of Mind Tasks.” Proceedings of the National Academy of Sciences.
Kwiatkowski, and Lipson. 2019. Task-Agnostic Self-Modeling Machines.” Science Robotics.
Lai, and Gershman. 2021. Policy Compression: An Information Bottleneck in Action Selection.” In Psychology of Learning and Motivation.
Leike, Lattimore, Orseau, et al. 2016. “Thompson Sampling Is Asymptotically Optimal in General Environments.” In Conference on Uncertainty in Artificial Intelligence.
Lieder, and Griffiths. 2020. Resource-Rational Analysis: Understanding Human Cognition as the Optimal Use of Limited Computational Resources.” Behavioral and Brain Sciences.
Lu, Willi, Witt, et al. 2022. Model-Free Opponent Shaping.” In Proceedings of the 39th International Conference on Machine Learning.
Mao, Liu, Ni, et al. 2024. A Review on Machine Theory of Mind.” IEEE Transactions on Computational Social Systems.
McKelvey, and Palfrey. 1995. Quantal Response Equilibria for Normal Form Games.” Games and Economic Behavior.
Metzinger. 2003. Being No One: The Self-Model Theory of Subjectivity.
Minsky. 1986. The Society of Mind.
Oguntola, Campbell, Stepputtis, et al. 2023. Theory of Mind as Intrinsic Motivation for Multi-Agent Reinforcement Learning.” arXiv.org.
Parr, Pezzulo, and Friston. 2022. Active Inference: The Free Energy Principle in Mind, Brain, and Behavior.
Rabinowitz, Perbet, Song, et al. 2018. “Machine Theory of Mind.” In International Conference on Machine Learning.
Rao, and Ballard. 1999. Predictive Coding in the Visual Cortex: A Functional Interpretation of Some Extra-Classical Receptive-Field Effects.” Nature Neuroscience.
Russell, Stuart. 2016. Rationality and Intelligence: A Brief Update.” In.
Russell, S. J., and Subramanian. 1995. Provably Bounded-Optimal Agents.” Journal of Artificial Intelligence Research.
Schmidhuber. 1993. A ‘Self-Referential’ Weight Matrix.” In.
———. 2003. Gödel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements.” arXiv.org.
Schrittwieser, Antonoglou, Hubert, et al. 2020. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.” Nature.
Sclar, Kumar, West, et al. 2023. Minding Language Models’ (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker.” In Annual Meeting of the Association for Computational Linguistics.
Shalizi, and Crutchfield. 2000. Computational Mechanics: Pattern and Prediction, Structure and Simplicity.”
Shazeer, Mirhoseini, Maziarz, et al. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.”
Simon. 1955. A Behavioral Model of Rational Choice.” The Quarterly Journal of Economics.
Sims, Jacobs, and Knill. 2012. An Ideal Observer Analysis of Visual Working Memory.” Psychological Review.
Stolk, Verhagen, and Toni. 2016. Conceptual Alignment: How Brains Achieve Mutual Understanding.” Trends in Cognitive Sciences.
Tishby, Pereira, and Bialek. 2000. The Information Bottleneck Method.”
Ullman. 2023. Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks.”
VanRullen, and Kanai. 2021. Deep Learning and the Global Workspace Theory.” Trends in Neurosciences.
Willi, Letcher, Treutlein, et al. 2022. COLA: Consistent Learning with Opponent-Learning Awareness.” In Proceedings of the 39th International Conference on Machine Learning.
Wilterson, and Graziano. 2021. The Attention Schema Theory in a Neural Network Agent: Controlling Visuospatial Attention Using a Descriptive Model of Attention.” Proceedings of the National Academy of Sciences.
Yudkowsky, and Herreshoff. 2013. Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.” Early Draft MIRI.
Zénon, Solopchuk, and Pezzulo. 2019. An Information-Theoretic Perspective on the Costs of Cognition.” Neuropsychologia, Cognitive Effort,.
Zhao, Lu, Grosse, et al. 2022. Proximal Learning With Opponent-Learning Awareness.” Neural Information Processing Systems.