Homunculi all the way down

Formal models of minds that model themselves and each other

2026-04-15 — 2026-04-21

Wherein a Budget of Finite Compute Is Found to Be Divided Among Recursion Depth, Model Fidelity, and Reflective Bookkeeping, With Existing Formalisms Revealed as Corner Solutions Spending on One Axis Alone.

agent foundations

bounded compute

cooperation

learning

mind

neural nets

probability

reinforcement learning

quagmire

theory of mind

when to compute

wonk

Research-background notes. I want to pin down what it would mean, formally, for a social entity to contain a reduced-rank model of another social entity — possibly even a reduced-rank model of itself. Here are the formalisms I’m aware of, drawing on LLM lit review, some PDFs I had in a folder, and some vibes-y dot points I sketched out at the PIBBS x ILIAD residency.

Adjacent run-ups

One of several notebooks I started on the same underlying problem. The others are agency_bounded_compute (the foundational why-must-the-agent-compress angle) and ai_economics_of_cognition (compute as a substitutable factor of production). I think that I have run into a dead end with this one. My current working hypothetis is that I have more success starting from mechanised causal graphs.

1 A phenomenon of note

A mind modelling another mind is an agent embedded in an environment that contains agents with comparable representational capacity to itself. If the only faithful model of Alice is Alice, then Bob cannot fit one in his head. A practical Bob must carry a compressed Alice: fewer parameters, coarser predictions, maybe with a cartoon-level ontology. Call this a reduced-rank other-model.¹

Bob must also act, and acting well might require that Bob predict his own future behaviour. If the only faithful model of Bob is Bob, he cannot fit one of those in his head either. So Bob carries a “reduced-rank” self-model. This self-model is what Metzinger calls the phenomenal self-model (Metzinger 2003), what Graziano’s Attention Schema Theory (Graziano 2013) makes into a neural control-theoretic object, and what Schmidhuber-flavoured AI calls a world-model-containing-self (Ha and Schmidhuber 2018).

The bicameral-mind literature (Jaynes 1976) sounds like it’s somewhat related — the sense that “I” am addressed by a voice that is also “I” — but it doesn’t seem formal enough to build on, so I will basically ignore it here.

I can think of three axes along which theories might vary:

Other-modelling. How do formalisms represent nested belief (“I think that you think that I think…”)?
Self-modelling. How do formalisms represent an agent that contains a compressed simulacrum of itself?
Reduced rank. By rank I mean the computational fidelity of a sub-model — the bits or parameters devoted to it, equivalently the resolution at which it can discriminate the situations it cares about. How is this reduction made rigorous — rate-distortion, PAC bounds, etc.?

2 Other-models

2.1 Interactive POMDPs

Gmytrasiewicz and Doshi’s interactive partially observable Markov decision processes (I-POMDPs) (Gmytrasiewicz and Doshi 2005) provide one formulation of recursive belief. The state space of any one agent is augmented with models of the other agents, which themselves include models of this agent, and so on. A finitely-nested I-POMDP truncates the recursion at level \(k\) — agents at level 0 treat others as noise; level 1 models level-0s; level 2 models level-1s; and so on. This is an operationalisation of “reduced rank”: the recursion is cut off, and the depth is tunable.

Suggestively related work in game theory:

Level-k / cognitive hierarchy models in behavioural game theory (Camerer, Ho, and Chong 2004), where players assume others are reasoning at a lower level than themselves. Humans reportedly cluster at \(k \in \{0,1,2\}\).
Quantal response equilibrium (McKelvey and Palfrey 1995), where bounded rationality is modelled by stochastic best-response rather than a deeper recursion.
Epistemic game theory (Dekel and Siniscalchi 2015), which formalises common knowledge, common belief, and the belief hierarchies above.

2.2 Bayesian Theory of Mind

Baker, Saxe, and Tenenbaum (C. L. Baker et al. 2017; C. Baker, Saxe, and Tenenbaum 2011) formalise human social cognition as inverse planning: observers invert a generative model of rational action to infer the latent goals and beliefs of others. The other-model here is the generative model — typically a small MDP or POMDP parameterised by a utility and belief — and inference is Bayesian. This gives us a concrete posterior over other minds’ rationality and actions that one can compute with and prove things about.

2.3 Machine Theory of Mind

Rabinowitz et al.’s ToMnet (Rabinowitz et al. 2018) is a deep-learning analogue: a meta-learning agent that, from a few observations of a target agent, infers an embedding which predicts the target’s future behaviour. The embedding is the reduced-rank other-model. ToMnet variants have been extended to false-belief tasks and inverse-RL settings (Oguntola et al. 2023).

Oguntola and colleagues push ToMnet toward interpretability with Concept Bottleneck Models (Oguntola, Hughes, and Sycara 2021): the network is forced to predict named mental-state concepts (the opponent believes the door is locked, wants the key) before producing an action, so the recursive belief-state is in principle inspectable by a human overseer. The catch, as ever, is concept leakage — the net routes information around the bottleneck through residual pathways and re-hides the mental states the bottleneck was meant to expose (Margeloiu et al. 2021). The same pathology is plausibly what any \(\Phi\)-probe on a residual stream will face (below).

2.4 Opponent-modelling in multi-agent RL

This also has a game-theoretic shape if you squint at it, but with learning theory sprinkled on top.

LOLA (Learning with Opponent-Learning Awareness) (Foerster et al. 2018) computes gradients through a model of the opponent’s learning dynamics. This is a differentiable other-model.
COLA, POLA, M-FOS and successors refine LOLA with higher-order or policy-level models (Willi et al. 2022; Zhao et al. 2022; Lu et al. 2022).
Opponent shaping more generally treats the other agent as a learnable dynamical system, which is a particular operationalisation of “the other is a reduced-rank version of me”.
Self Other-Modelling (SOM) (Raileanu et al. 2018) takes the last bullet literally: the agent uses its own policy to predict the other agent’s actions, and online-updates a belief over the other’s hidden goal. A single generative model is reconfigured between egocentric and allocentric readings of the same observation. This is computational simulation theory — model the other by running your own machinery with the other’s sensors wired in — and it induces cooperation in imperfect-information resource-gathering without reward shaping or explicit communication. In the budget frame below it is the corner case where the self-model is the other-model and pays for both with the same bits.

2.5 LLMs and emergent theory of mind

The question of whether transformer language models contain an implicit theory of mind has generated a cottage industry (Kosinski 2023; Ullman 2023; Sclar et al. 2023; Gandhi et al. 2023). The answer seems to be that they carry shallow heuristics that look like ToM on canonical tasks and break on adversarial ones. Whatever they do have is, almost by construction, a reduced-rank model: compressed into attention patterns and residual-stream features.

2.6 Mechanised causal graphs

Causal games (Hammond et al. 2023) and mechanised causal Bayesian networks (MacDermott, Everitt, and Belardinelli 2023) supply the graph-theoretic vocabulary for the formalisms above. Each object variable \(V\) acquires a mechanism parent \(\tilde{V}\): if \(V\) is an agent’s decision \(D\), then \(\tilde{D}\) is its decision rule. The strategic structure — who observes what, who chooses what — lives at the mechanism layer, and reasoning about an agent becomes reasoning over a graph of mechanism variables connected by edges of strategic relevance.

Two things matter for us.

One, what Bob carries of Alice has to be a model of Alice’s mechanism, not of Alice’s output. A pure action-predictor short-circuits the recursion: any Alice who outsmarts the predictor falsifies it, and the graph picks up a cycle (MacDermott, Everitt, and Belardinelli 2023). This is a graphical criterion for what counts as a model-of-another-agent in the first place, and it rhymes with the observer-relative construction in Virgo et al. (2025) at internal models. Different formalisms, same shape of constraint — what Bob represents of Alice is a map to mechanism-space, not to action-space.

Two, one rung of mechanism-level modelling is native to the framework, but deeper recursion is not — there is no \(\tilde{\tilde{D}}\). Hammond et al. recover depth by unrolling a mechanised MAID into an extensive-form game, with the familiar exponential blow-up in tree size. In the budget frame below that exponential is the cost of depth in this particular vocabulary — nodes rather than, say, bits.

3 Self-models: the formal landscape

3.1 World models containing self

Figure 2: Knowing thyself through reflection

A direct ML instantiation is Ha & Schmidhuber’s World Models (Ha and Schmidhuber 2018), where a recurrent latent model predicts both environment dynamics and the consequences of the agent’s own actions. The agent’s policy is trained inside this compressed dreamscape. The self here is a reduced-rank conditional — “what would my controller do, given this latent” — rather than an introspectable entity, but the “compression” part sounds well-posed.

The lineage continues through Dreamer (Hafner et al. 2024), MuZero (Schrittwieser et al. 2020), and the larger world-model-RL programme.

3.2 Active inference and the self as generative model

Active inference (Friston et al. 2017; Parr, Pezzulo, and Friston 2022) treats the agent as a generative model of its own sensorium, including its own actions. Free-energy minimisation forces the self-model to be as compressed as is consistent with prediction — a direct rate-distortion pressure. The self here is a probabilistic model with the agent’s own observation-action trajectory as a latent.

3.3 Self-modelling robots

A concrete line: Bongard, Zykov, and Lipson’s Resilient Machines Through Continuous Self-Modeling (Bongard, Zykov, and Lipson 2006) — a quadruped robot that learns a forward model of its own body, then uses it to plan locomotion; when a limb is damaged, the model updates, and the robot recovers. The self-model is an explicit, parameterised, low-rank dynamical system. See the follow-up (Kwiatkowski and Lipson 2019) for differentiable variants.

3.4 Attention Schema Theory

Graziano (Graziano 2013; Graziano et al. 2019) argues that consciousness is the brain’s (incomplete, schematic) model of its own attention. This is explicitly a reduced-rank model: the schema is coarser than the machinery it represents, because representing attention in full would require as much machinery as attention itself. Kaplan, Dolan, and colleagues have attempted to operationalise this in neural-network models (Wilterson and Graziano 2021).

3.5 Schmidhuber and reflective learners

Schmidhuber’s early work on self-referential neural networks (Schmidhuber 1993) and later Gödel machines (Schmidhuber 2003) formalises learners that inspect and modify their own code, subject to provability constraints. The Gödel-machine construction is where the proof-theoretic aspect of self-modelling seems to cause grief: self-modification is gated by a proof that the modification improves expected utility.

3.6 Predictive coding and hierarchical self

Hierarchical predictive coding architectures (Rao and Ballard 1999; Clark 2013) include top-down predictions that span the whole sensory hierarchy, including proprioceptive and interoceptive signals — i.e., representations of the organism. See predictive coding, again. And harder.

3.7 Reflective LLM agents

A lighter ML lineage treats self-reflection as a control loop over the decoder rather than as a learnt world-model. Reflexion (Shinn et al. 2023) keeps a log of past behaviour, self-critiques, and revised plans in the context window, so the agent adjusts across episodes without any gradient update — verbal reinforcement learning rather than the usual kind. Language Agent Tree Search (Zhou et al. 2024) extends this to a Monte-Carlo tree search over action paths, with an LLM-powered value function and self-reflection pruning unpromising branches. The self-model here is an in-context transcript and the reflection is a prompt — coarse compared with active inference or Dreamer, and with no mechanism for the self-model to outlive the context. I include them as an existence proof for a reduced-rank self-model whose entire cost lives at inference time rather than in training, which is a distinct point on the budget frontier from everything above.

4 Reducing fidelity of representation

Several toolkits formalise “reduced rank”:

Rate-distortion theory applied to cognition (Sims, Jacobs, and Knill 2012; Zénon, Solopchuk, and Pezzulo 2019; Lai and Gershman 2021): the cost of mental representation is an information-theoretic rate, the benefit is task performance, and optimal bounded agents sit on the rate-distortion frontier.
Information bottleneck (Tishby, Pereira, and Bialek 2000; Alemi et al. 2019): compress inputs to a latent that is maximally informative about a downstream variable. When the downstream variable is “the other agent’s next action”, the bottleneck induces a reduced-rank other-model.
Resource-rational analysis (Lieder and Griffiths 2020): agents are optimal given bounded compute; the bound is the reduction.
Successor representations / features (Dayan 1993; Barreto et al. 2017): compressed future-prediction models that generalise well across reward functions. A kind of reduced-rank self-model of one’s own policy.
Bounded rationality as a research programme (Simon 1955; S. J. Russell and Subramanian 1995; S. Russell 2016).
Epsilon-machines / computational mechanics (Crutchfield and Young 1989; Shalizi and Crutchfield 2000): minimal-sufficient-statistic models of a process. The causal states are the minimum-rank predictor.

My research agent further recommends we look at theory of mind as mutual information (Jara-Ettinger 2019) and the recent graph-theoretic accounts of social abstraction (Stolk, Verhagen, and Toni 2016).

5 Multi-agent self

The bicameral intuition — “the mind is many minds talking” — turns up across Minsky’s Society of Mind (Minsky 1986), Global Workspace Theory (Baars 1993; Dehaene 2014) and its neural-network and RL formalisations (VanRullen and Kanai 2021; Goyal et al. 2021), mixture of experts (Jacobs et al. 1991; Shazeer et al. 2017), and Dennett’s multiple drafts (Dennett 1993). They share an architectural claim: the “self” is what falls out of specialist modules competing for a low-capacity shared bottleneck, and that bottleneck is where the reduced-rank self- and other-models have to sit. Attention Schema Theory (above) fits here too. Fuller treatment at multi-agent self.

6 Self-referential agents and the proof-theoretic frontier

This is where reflectivity — the pressure for an agent’s beliefs about its own future beliefs to be consistent — becomes a first-class concern. If we want to prove things about minds that model themselves, we hit self-reference, and self-reference hits Löb’s theorem and friends if we’re not careful. As presaged, I am not super excited about the parts of this line of work that edge into unbounded compute.

Reflectivity is operationalised differently by each of the constructions below, and each buys self-consistency in a different coin — fixed-point iteration, market mixing, proof search — so its “cost” lives on a different axis depending on who you ask. I’ll treat its cost as real but construction-dependent.

The self-consistency cluster. Reflective oracles (Fallenstein, Taylor, and Christiano 2015) (a fixed-point construction of probability distributions closed under self-reference), logical induction (Garrabrant et al. 2020) (a market-based learner whose beliefs about its own future beliefs converge), and the Löbian obstacle (Yudkowsky and Herreshoff 2013) (the negative result that a self-modifying agent unwilling to endorse a weaker-or-equal successor runs into Löb’s theorem): three constructions attacking the same self-consistency problem in three different coins.
Modal combat agents and program equilibrium (Bárász et al. 2014; Critch 2019): agents that condition on source-code-level models of each other; admits bona fide equilibria in the one-shot prisoner’s dilemma.
AIXI and approximations (Hutter 2005; Leike et al. 2016): a formally optimal agent whose self-model is implicit in the universal prior; computable approximations (e.g., AIXI-tl, MC-AIXI) buy tractability at the cost of reducing the rank of the prior.

MIRI’s Agent Foundations programme is the main hub for this line of work.

7 Let’s attempt synthesis!

The three axes above — other-modelling, self-modelling, reduced-rank — are not independent knobs an architect sets arbitrarily. I’d argue they are three demands on the compute embedded in a single joint representation carried by a social agent — depth of recursion, rank of each sub-model, and reflectivity — competing for the same finite resources. This is of a piece with my general heuristic that we should always think about where to spend our compute to make sense of the AI landscape as it exists.

“Compute” here is a cover term: it lumps together bits of representation at rest, operations per decision, and data to fit the representation in the first place. These don’t fully interchange, and nobody defines a fungible unit that resolves the trade-offs cleanly AFAIK. For the argument that follows the weaker claim is enough — that the three axes all draw on a shared pool, and that each spends it through its own, largely unknown, return on investment. Rate-distortion theory gives a concave \(R(D)\) for rank (Sims, Jacobs, and Knill 2012); the analogous curves for recursion depth and reflectivity are open problems.

On this reading the existing literature is a tour of corner solutions: each framework spends its compute on one or two axes and lets the rest go to zero.

I-POMDPs and level-\(k\) / cognitive hierarchy: spend on depth; each level is a cartoon of the next. Humans clustering at \(k \in \{0,1,2\}\) is consistent with a small budget.
Rate-distortion cognition, information bottleneck, successor features: spend on rank at fixed depth (\(k=1\)); reflectivity ignored.
Reflective oracles, logical induction, Gödel machines: spend on reflectivity; other-structure kept simple enough that the construction survives.
Active inference: allocation across self- and world-model; depth and reflectivity implicit.
LOLA / opponent shaping: \(k=2\); rank inherited from the opponent’s parameter count.
World models (Ha/Schmidhuber, Dreamer, MuZero): storage in the latent dimension; \(k=1\); reflectivity absent.
ToMnet and I-POMDP-Net (Han and Gmytrasiewicz 2019): learned reduced-rank other-models; \(k \le 2\); reflectivity absent.
SOM (Raileanu et al. 2018): \(k=1\) with self-model and other-model sharing bits; reflectivity absent. A case of the axes interacting rather than competing.
Reflexion / LATS (Shinn et al. 2023; Zhou et al. 2024): inference-time reflectivity over an in-context self-model; no learnt other-model; rank bounded by the context window.

7.1 Depth as a profile

The “depth” axis above reads as a single number \(k\) in level-\(k\) and I-POMDP treatments, but that collapses some structure worth pulling apart. Bob regulates a scene that contains Alice; there is a natural partial ordering on what kind of thing Alice can be inside Bob’s world-model.

Alice as noise — a draw from a distribution over behaviours. No agent-shaped variables in Bob’s ontology.
Alice as a stateful process — memory, non-Markovian dynamics, possibly a “type” Bob is inferring. Not intentional.
Alice as a belief-carrier whose model of Bob is type-0. She has goals, she is optimising against a world-model, but that world-model contains Bob only as noise.
Alice as a belief-carrier whose Bob-model is type-1. She conditions on Bob’s type but does not treat him as intentional.
Alice as a belief-carrier whose Bob-model is itself belief-carrying. Recursion; the same classification applies at the next rung.

Scalar \(k\) is the uniform case: every rung up to depth \(k\) is type-4, and the chain terminates in type-0 at the bottom. What the scalar flattens is the profile — a sequence of (ontology type, rank) pairs along the chain, which in general need not be uniform. In particular it throws away asymmetric drop-off: “Alice thinks I am nearly noise” (type-2 terminal) and “Alice thinks I am stateful” (type-3 terminal) are both depth-2 in scalar terms, but play differently. In the first Bob has slack to act unpredictably; in the second Alice is already conditioning on his type, and only type-misrepresentation is exploitable. Budget-constrained agents probably allocate non-uniformly — most of their rank on the first rung or two, terminating early into type 0 or 1. The human plateau at \(k \in \{0,1,2\}\) is at least as consistent with this as with “humans can’t recurse further”, and the shallow-but-lopsided flavour of LLM-ToM failure (Ullman 2023) looks similar.

Two scalars suggest themselves in place of \(k\) and they answer different questions: chain length (the traditional quantity), and total rank spent on reflection, \(\sum_i R_i\) along the chain. The budget frame of this post is more naturally the second; the level-\(k\) literature has mostly used the first.²

Humans, LLMs, and the agents we would actually like to build do not sit at any corner. They are interior points, and the interior is where unwritten papers live.

8 Angles of attack

I got an LLM to ideate below ideas for me. Lightly edited for baseline sanity. The budget view suggests several angles of attack on what we actually care about — understanding how real cognition, human or machine, allocates its compute across world, self, and other. Two concrete ones, plus a brief aside on scaling the same calculus to collectives.

8.1 Finding \(\Phi\) in a transformer

The synthesis above says a bounded social agent carries a structured latent — call it \(\Phi\) — partitioned into world, self, each other, and reflective bookkeeping. The ToM-in-LLMs literature (Kosinski 2023; Ullman 2023) presently seems to treat these pieces as present-or-absent behavioural properties. If a frontier model has acquired any social competence under finite pretraining compute, there should be detectable structure in \(\Phi\) tracking these pieces — not necessarily a neat partition, because superposition lets a learner share parameters across sparse features, but some signature visible to the right probe.

That is a mechanistic-interpretability problem. Run sparse-autoencoder or dictionary-learning analysis on the residual stream during ToM-style tasks, and look for:

a low-rank subspace whose ablation selectively breaks co-agent prediction without breaking world-prediction;
a disjoint self-model subspace whose ablation breaks in-character behaviour but not factual recall;
approximately no reflective-bookkeeping subspace in base models, and a small one in post-RLHF models — which are rewarded for consistency with a remembered “I”, and so should be under pressure to allocate bits to reflective bookkeeping that a base model is not.

If the structure is not there, one of three things is wrong: the framework, our assumption that representational storage is actually binding for this class of model, or our belief that the pretraining signal actually rewards social prediction rather than a confound. Each possibility is informative. If the structure is there, we have an interpretability handle on social cognition specifically, rather than features in general.

8.2 A value-of-reflectivity calculation

The MIRI tradition treats reflectivity as a correctness property: a reasoner that cannot consistently model its own reasoning has a bug, and the whole agent foundations edifice is a search for bug-free constructions. The budget view recasts it as an economic question: when is it worth spending bits on reflective machinery, and when is it better to skip it?

An idea in this space is that reflectivity pays in proportion to the reactivity of the environment — the degree to which other agents, or future versions of oneself, condition on your internal commitments rather than just your past actions. Non-reactive one-shot stateless POMDPs: reflectivity is dispensable. Program-equilibrium and modal-combat settings (Bárász et al. 2014; Critch 2019): reflectivity must be high, because opponents are literally reading your source. Open-ended self-modification (Gödel-machine style): Löbian obstacles arise exactly because reflectivity has to be preserved while the substrate is mutating underneath it.

A testable version of this, without the formalism: build toy environments in each regime — a non-reactive bandit, a modal-combat tournament, a self-modification sandbox — and train the same architecture across all three. The conjecture predicts that reflective structure (by whatever detector the \(\Phi\) angle supplies) should emerge spontaneously in the second and third, and not the first. That is a scaling claim about spontaneous situational awareness under training pressure, and it falsifies in either direction.

Premakumar et al. (Premakumar et al. 2024) train networks with an auxiliary task of predicting their own internal activations and observe narrower weight distributions and reduced effective complexity — a “self-prediction-as-regularizer” effect. As a bonus the regularised networks are easier for other networks to model. If the effect replicates beyond their setting then reflectivity is paying for itself partly in rank elsewhere, which is the shape of trade-off the budget frame predicts. It also gives an interpretability-friendly story about alignment: agents incentivised to self-predict become more predictable to overseers, for free. The LessWrong writeup makes the alignment spin explicit.

8.3 Aside: collectives are instrumentable

The same budget calculus should scale past individual minds. A firm, team, or agency is a social agent carrying compressed models of itself and its counterparties — self-model embodied in policy and narrative, other-models in competitor dossiers and customer segments, reflectivity in governance and audit. This complements the transformer angle above: organisations are, if anything, easier to instrument than LLMs — meeting notes, decision logs, communication graphs, and promotion criteria are legible in a way residual-stream activations are not. Mapping those proxies to rank, depth, and reflectivity is a separate project, but the framework commits to specific organisational over-allocations (governance theatre, regulatory capture, siloed product teams, academic insularity) being allocation failures of the same kind as the cognitive ones, which gives two independent scales at which to play with it.

10 Incoming

Dennett and Hofstadter’s The Mind’s I as a pre-formal reading list.
Hofstadter’s strange loops (Hofstadter 2008): evocative, not formal, but a useful bridge.
Tononi’s IIT as a measure on self-models, if one is feeling brave.
Whether the “observer self” of contemplative traditions corresponds to an attention-schema-style reduced-rank self-model — this seems to be Metzinger’s read.

11 References

Alemi, Fischer, Dillon, et al. 2019. “Deep Variational Information Bottleneck.”

Baars. 1993. A Cognitive Theory of Consciousness.

Baker, Chris L., Jara-Ettinger, Saxe, et al. 2017. “Rational Quantitative Attribution of Beliefs, Desires and Percepts in Human Mentalizing.” Nature Human Behaviour.

Baker, Chris, Saxe, and Tenenbaum. 2011. “Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution.” Proceedings of the Annual Meeting of the Cognitive Science Society.

Bárász, Christiano, Fallenstein, et al. 2014. “Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic.” arXiv.org.

Barreto, Dabney, Munos, et al. 2017. “Successor Features for Transfer in Reinforcement Learning.” In Advances in Neural Information Processing Systems.

Berg, Plumb, and Ganin. 2024. “Self-Prediction Acts as an Emergent Regularizer.” LessWrong.

Bilodeau, Foster, and Roy. 2023. “Minimax Rates for Conditional Density Estimation via Empirical Entropy.” The Annals of Statistics.

Bongard, Zykov, and Lipson. 2006. “Resilient Machines Through Continuous Self-Modeling.” Science.

Camerer, Ho, and Chong. 2004. “A Cognitive Hierarchy Model of Games.” The Quarterly Journal of Economics.

Clark. 2013. “Whatever Next? Predictive Brains, Situated Agents, and the Future of Cognitive Science.” Behavioral and Brain Sciences.

Critch. 2019. “A Parametric, Resource-Bounded Generalization of Löb’s Theorem, and a Robust Cooperation Criterion for Open-Source Game Theory.” The Journal of Symbolic Logic.

Crutchfield, and Young. 1989. “Inferring Statistical Complexity.” Physical Review Letters.

Dayan. 1993. “Improving Generalization for Temporal Difference Learning: The Successor Representation.” Neural Computation.

Dehaene. 2014. Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts.

Dekel, and Siniscalchi. 2015. “Epistemic Game Theory.” In.

Dennett. 1993. Consciousness Explained.

Fallenstein, Taylor, and Christiano. 2015. “Reflective Oracles: A Foundation for Classical Game Theory.”

Feng, and Kirkley. 2020. “Online Geolocalized Emotion Across US Cities During the COVID Crisis: Universality, Policy Response, and Connection with Local Mobility.” arXiv.org.

Foerster, Chen, Al-Shedivat, et al. 2018. “Learning with Opponent-Learning Awareness.” In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’18.

Friston, FitzGerald, Rigoli, et al. 2017. “Active Inference: A Process Theory.” Neural Computation.

Gandhi, Franken, Gerstenberg, et al. 2023. “Understanding Social Reasoning in Language Models with Language Models.” Neural Information Processing Systems.

Garrabrant, Benson-Tilsen, Critch, et al. 2020. “Logical Induction.”

Gmytrasiewicz, and Doshi. 2005. “A Framework for Sequential Planning in Multi-Agent Settings.” Journal of Artificial Intelligence Research.

Goyal, Didolkar, Lamb, et al. 2021. “Coordination Among Neural Modules Through a Shared Global Workspace.” International Conference on Learning Representations.

Graziano. 2013. Consciousness and the Social Brain.

Graziano, Guterstam, Bio, et al. 2019. “Toward a Standard Model of Consciousness: Reconciling the Attention Schema, Global Workspace, Higher-Order Thought, and Illusionist Theories.” Cognitive Neuropsychology.

Hafner, Pasukonis, Ba, et al. 2024. “Mastering Diverse Domains Through World Models.”

Hammond, Fox, Everitt, et al. 2023. “Reasoning about Causality in Games.” Artificial Intelligence.

Han, and Gmytrasiewicz. 2019. “IPOMDP-Net: A Deep Neural Network for Partially Observable Multi-Agent Planning Using Interactive POMDPs.” In AAAI Conference on Artificial Intelligence.

Ha, and Schmidhuber. 2018. “World Models.” arXiv.org.

Hofstadter. 2008. I Am a Strange Loop: By Douglas R. Hofstadter: 0.

Hutter. 2005. Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Texts in Theoretical Computer Science.

Jacobs, Jordan, Nowlan, et al. 1991. “Adaptive Mixtures of Local Experts.” Neural Computation.

Jara-Ettinger. 2019. “Theory of Mind as Inverse Reinforcement Learning.” Current Opinion in Behavioral Sciences.

Jaynes. 1976. The Origin of Consciousness in the Breakdown of the Bicameral Mind.

Kosinski. 2023. “Theory of Mind May Have Spontaneously Emerged in Large Language Models.” Research Papers, Research Papers,.

———. 2024. “Evaluating Large Language Models in Theory of Mind Tasks.” Proceedings of the National Academy of Sciences.

Kwiatkowski, and Lipson. 2019. “Task-Agnostic Self-Modeling Machines.” Science Robotics.

Lai, and Gershman. 2021. “Policy Compression: An Information Bottleneck in Action Selection.” In Psychology of Learning and Motivation.

Leike, Lattimore, Orseau, et al. 2016. “Thompson Sampling Is Asymptotically Optimal in General Environments.” In Conference on Uncertainty in Artificial Intelligence.

Lieder, and Griffiths. 2020. “Resource-Rational Analysis: Understanding Human Cognition as the Optimal Use of Limited Computational Resources.” Behavioral and Brain Sciences.

Lu, Willi, Witt, et al. 2022. “Model-Free Opponent Shaping.” In Proceedings of the 39th International Conference on Machine Learning.

MacDermott, Everitt, and Belardinelli. 2023. “Characterising Decision Theories with Mechanised Causal Graphs.”

Mao, Liu, Ni, et al. 2024. “A Review on Machine Theory of Mind.” IEEE Transactions on Computational Social Systems.

Margeloiu, Ashman, Bhatt, et al. 2021. “Do Concept Bottleneck Models Learn as Intended?” In ICLR 2021 Workshop on Responsible AI.

McKelvey, and Palfrey. 1995. “Quantal Response Equilibria for Normal Form Games.” Games and Economic Behavior.

Metzinger. 2003. Being No One: The Self-Model Theory of Subjectivity.

Minsky. 1986. The Society of Mind.

Oguntola, Campbell, Stepputtis, et al. 2023. “Theory of Mind as Intrinsic Motivation for Multi-Agent Reinforcement Learning.” arXiv.org.

Oguntola, Hughes, and Sycara. 2021. “Deep Interpretable Models of Theory of Mind.” In IEEE International Symposium on Robot and Human Interactive Communication.

Parr, Pezzulo, and Friston. 2022. Active Inference: The Free Energy Principle in Mind, Brain, and Behavior.

Premakumar, Vaiana, Pop, et al. 2024. “Unexpected Benefits of Self-Modeling in Neural Systems.”

Rabinowitz, Perbet, Song, et al. 2018. “Machine Theory of Mind.” In International Conference on Machine Learning.

Raileanu, Denton, Szlam, et al. 2018. “Modeling Others Using Oneself in Multi-Agent Reinforcement Learning.” In Proceedings of the 35th International Conference on Machine Learning.

Rao, and Ballard. 1999. “Predictive Coding in the Visual Cortex: A Functional Interpretation of Some Extra-Classical Receptive-Field Effects.” Nature Neuroscience.

Russell, Stuart. 2016. “Rationality and Intelligence: A Brief Update.” In.

Russell, S. J., and Subramanian. 1995. “Provably Bounded-Optimal Agents.” Journal of Artificial Intelligence Research.

Schmidhuber. 1993. “A ‘Self-Referential’ Weight Matrix.” In.

———. 2003. “Gödel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements.” arXiv.org.

Schrittwieser, Antonoglou, Hubert, et al. 2020. “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.” Nature.

Sclar, Kumar, West, et al. 2023. “Minding Language Models’ (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker.” In Annual Meeting of the Association for Computational Linguistics.

Shalizi, and Crutchfield. 2000. “Computational Mechanics: Pattern and Prediction, Structure and Simplicity.” Journal of Statistical Physics.

Shazeer, Mirhoseini, Maziarz, et al. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.”

Shinn, Cassano, Berman, et al. 2023. “Reflexion: Language Agents with Verbal Reinforcement Learning.” In Advances in Neural Information Processing Systems 36 (NeurIPS 2023).

Simon. 1955. “A Behavioral Model of Rational Choice.” The Quarterly Journal of Economics.

Sims, Jacobs, and Knill. 2012. “An Ideal Observer Analysis of Visual Working Memory.” Psychological Review.

Stolk, Verhagen, and Toni. 2016. “Conceptual Alignment: How Brains Achieve Mutual Understanding.” Trends in Cognitive Sciences.

Tishby, Pereira, and Bialek. 2000. “The Information Bottleneck Method.”

Ullman. 2023. “Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks.”

VanRullen, and Kanai. 2021. “Deep Learning and the Global Workspace Theory.” Trends in Neurosciences.

Virgo, Biehl, Baltieri, et al. 2025. “A “Good Regulator Theorem” for Embodied Agents.” In.

Willi, Letcher, Treutlein, et al. 2022. “COLA: Consistent Learning with Opponent-Learning Awareness.” In Proceedings of the 39th International Conference on Machine Learning.

Wilterson, and Graziano. 2021. “The Attention Schema Theory in a Neural Network Agent: Controlling Visuospatial Attention Using a Descriptive Model of Attention.” Proceedings of the National Academy of Sciences.

Yudkowsky, and Herreshoff. 2013. “Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.” Early Draft MIRI.

Zénon, Solopchuk, and Pezzulo. 2019. “An Information-Theoretic Perspective on the Costs of Cognition.” Neuropsychologia, Cognitive Effort,.

Zhao, Lu, Grosse, et al. 2022. “Proximal Learning With Opponent-Learning Awareness.” Neural Information Processing Systems.

Zhou, Yan, Shlapentokh-Rothman, et al. 2024. “Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models.” In Proceedings of the 41st International Conference on Machine Learning (ICML 2024).

Footnotes

Alternatively it could be a full-rank model, which gets very weird and makes people worry about Löb’s theorem. That is not the main focus here↩︎
An observer-relative formulation of what it means for Bob to have a model at all is in Virgo et al. (2025), sitting alongside the internal-model-principle thread at internal models. Their construction gives a floor on what counts as modelling — one among several plausible candidates, with differing divergence and ontology commitments.↩︎