Homunculi all the way down
Formal models of minds that model themselves and each other
2026-04-15 — 2026-04-21
Wherein a Budget of Finite Compute Is Found to Be Divided Among Recursion Depth, Model Fidelity, and Reflective Bookkeeping, With Existing Formalisms Revealed as Corner Solutions Spending on One Axis Alone.
Research-background notes. I want to pin down what it would mean, formally, for a social entity to contain a reduced-rank model of another social entity — possibly even a reduced-rank model of itself. Here are the formalisms I’m aware of, drawing on LLM lit review, some PDFs I had in a folder, and some vibes-y dot points I sketched out at the PIBBS x ILIAD residency.
One of several notebooks I started on the same underlying problem. The others are agency under boundedcompute (the foundational why-must-the-agent-compress angle) and Economics of Cognition (compute as a substitutable factor of production). I think that I have run into a dead end with this one. My current working hypothesis is that I will have more success starting from mechanized causal graphs.
1 A phenomenon of note
A mind modelling another mind is an agent embedded in an environment that contains agents with comparable representational capacity to itself. If the only faithful model of Alice is Alice, then Bob cannot fit one in his head. A practical Bob must carry a compressed Alice: fewer parameters, coarser predictions, maybe with a cartoon-level ontology. Call this a reduced-rank other-model.1
Bob must also act, and acting well might require that Bob predict his own future behaviour. If the only faithful model of Bob is Bob, he cannot fit one of those in his head either. So Bob carries a “reduced-rank” self-model. This self-model is what Metzinger calls the phenomenal self-model (Metzinger 2003), what Graziano’s Attention Schema Theory (Graziano 2013) makes into a neural control-theoretic object, what Schmidhuber-flavoured AI calls a world-model-containing-self (Ha and Schmidhuber 2018), and what Hofstadter had been calling a strange loop since before any of them (Hofstadter 2008) — of which more below.
The bicameral-mind literature (Jaynes 1976) sounds like it’s somewhat related — the sense that “I” am addressed by a voice that is also “I” — but it doesn’t seem formal enough to build on, so I will basically ignore it here.
I can think of three axes along which theories might vary:
- Other-modelling. How do formalisms represent nested belief (“I think that you think that I think…”)?
- Self-modelling. How do formalisms represent an agent that contains a compressed simulacrum of itself?
- Reduced rank. By rank I mean the computational fidelity of a sub-model — the bits or parameters devoted to it, equivalently the resolution at which it can discriminate the situations it cares about. How is this reduction made rigorous — rate-distortion, PAC bounds, etc.?
2 Other-models
2.1 Interactive POMDPs
Gmytrasiewicz and Doshi’s interactive partially observable Markov decision processes (I-POMDPs) (Gmytrasiewicz and Doshi 2005) provide one formulation of recursive belief. The state space of any one agent is augmented with models of the other agents, which themselves include models of this agent, and so on. A finitely-nested I-POMDP truncates the recursion at level \(k\) — agents at level 0 treat others as noise; level 1 models level-0s; level 2 models level-1s; and so on. This is an operationalization of “reduced rank”: the recursion is cut off, and the depth is tunable.
Suggestively related work in game theory:
- Level-k / cognitive hierarchy models in behavioural game theory (Camerer, Ho, and Chong 2004), where players assume others are reasoning at a lower level than themselves. Humans reportedly cluster at \(k \in \{0,1,2\}\).
- Quantal response equilibrium (McKelvey and Palfrey 1995), where bounded rationality is modelled by stochastic best-response rather than a deeper recursion.
- Epistemic game theory (Dekel and Siniscalchi 2015), which formalizes common knowledge, common belief, and the belief hierarchies above.
2.2 Bayesian Theory of Mind
Baker, Saxe, and Tenenbaum (C. L. Baker et al. 2017; C. Baker, Saxe, and Tenenbaum 2011) formalize human social cognition as inverse planning: observers invert a generative model of rational action to infer the latent goals and beliefs of others. The other-model here is the generative model — typically a small MDP or POMDP parameterized by a utility and belief — and inference is Bayesian. This gives us a concrete posterior over other minds’ rationality and actions that one can compute with and prove things about.
2.3 Machine Theory of Mind
Rabinowitz et al.’s ToMnet (Rabinowitz et al. 2018) is a deep-learning analogue: a meta-learning agent that, from a few observations of a target agent, infers an embedding which predicts the target’s future behaviour. The embedding is the reduced-rank other-model. ToMnet variants have been extended to false-belief tasks and inverse-RL settings (Oguntola et al. 2023).
Oguntola and colleagues push ToMnet toward interpretability with Concept Bottleneck Models (Oguntola, Hughes, and Sycara 2021): the network is forced to predict named mental-state concepts (the opponent believes the door is locked, wants the key) before producing an action, so the recursive belief-state is in principle inspectable by a human overseer. The catch, as ever, is concept leakage — the net routes information around the bottleneck through residual pathways and re-hides the mental states the bottleneck was meant to expose (Margeloiu et al. 2021). The same pathology is plausibly what any \(\Phi\)-probe on a residual stream will face (below).
2.4 Opponent-modelling in multi-agent RL
This also has a game-theoretic shape if we squint at it, but with learning theory sprinkled on top.
- LOLA (Learning with Opponent-Learning Awareness) (Foerster et al. 2018) computes gradients through a model of the opponent’s learning dynamics. This is a differentiable other-model.
- COLA, POLA, M-FOS and successors refine LOLA with higher-order or policy-level models (Willi et al. 2022; Zhao et al. 2022; Lu et al. 2022).
- Opponent shaping more generally treats the other agent as a learnable dynamical system, which is a particular operationalization of “the other is a reduced-rank version of me”.
- Self Other-Modelling (SOM) (Raileanu et al. 2018) takes the last bullet literally: the agent uses its own policy to predict the other agent’s actions, and online-updates a belief over the other’s hidden goal. A single generative model is reconfigured between egocentric and allocentric readings of the same observation. This is computational simulation theory — model the other by running our own machinery with the other’s sensors wired in — and it induces cooperation in imperfect-information resource-gathering without reward shaping or explicit communication. In the budget frame below it is the corner case where the self-model is the other-model and pays for both with the same bits.
2.5 LLMs and emergent theory of mind
The question of whether transformer language models contain an implicit theory of mind has generated a cottage industry (Kosinski 2023; Ullman 2023; Sclar et al. 2023; Gandhi et al. 2023). The answer seems to be that they carry shallow heuristics that look like ToM on canonical tasks and break on adversarial ones. Whatever they do have is, almost by construction, a reduced-rank model: compressed into attention patterns and residual-stream features.
2.6 Mechanized causal graphs
Causal games (Hammond et al. 2023) and mechanized causal Bayesian networks (MacDermott, Everitt, and Belardinelli 2023) supply the graph-theoretic vocabulary for the formalisms above. Each object variable \(V\) acquires a mechanism parent \(\tilde{V}\): if \(V\) is an agent’s decision \(D\), then \(\tilde{D}\) is its decision rule. The strategic structure — who observes what, who chooses what — lives at the mechanism layer, and reasoning about an agent becomes reasoning over a graph of mechanism variables connected by edges of strategic relevance.
Two things matter for us.
One, what Bob carries of Alice has to be a model of Alice’s mechanism, not of Alice’s output. A pure action-predictor short-circuits the recursion: any Alice who outsmarts the predictor falsifies it, and the graph picks up a cycle (MacDermott, Everitt, and Belardinelli 2023). This is a graphical criterion for what counts as a model-of-another-agent in the first place, and it rhymes with the observer-relative construction in Virgo et al. (2025) at internal models. Different formalisms, same shape of constraint — what Bob represents of Alice is a map to mechanism-space, not to action-space.
Two, one rung of mechanism-level modelling is native to the framework, but deeper recursion is not — there is no \(\tilde{\tilde{D}}\). Hammond et al. recover depth by unrolling a mechanized MAID into an extensive-form game, with the familiar exponential blow-up in tree size. In the budget frame below that exponential is the cost of depth in this particular vocabulary — nodes rather than, say, bits.
3 Self-models
3.1 World models containing self
A direct ML instantiation is Ha & Schmidhuber’s World Models (Ha and Schmidhuber 2018), where a recurrent latent model predicts both environment dynamics and the consequences of the agent’s own actions. The agent’s policy is trained inside this compressed dreamscape. The self here is a reduced-rank conditional — “what would my controller do, given this latent” — rather than an introspectable entity, but the “compression” part sounds well-posed.
The lineage continues through Dreamer (Hafner et al. 2024), MuZero (Schrittwieser et al. 2020), and the larger world-model-RL programme.
See also world models.
3.2 Active inference and the self as generative model
Active inference (Friston et al. 2017; Parr, Pezzulo, and Friston 2022) treats the agent as a generative model of its own sensorium, including its own actions. Free-energy minimization forces the self-model to be as compressed as is consistent with prediction — a direct rate-distortion pressure. The self here is a probabilistic model with the agent’s own observation-action trajectory as a latent.
3.3 Self-modelling robots
A concrete line: Bongard, Zykov, and Lipson’s Resilient Machines Through Continuous Self-Modeling (Bongard, Zykov, and Lipson 2006) — a quadruped robot that learns a forward model of its own body, then uses it to plan locomotion; when a limb is damaged, the model updates, and the robot recovers. The self-model is an explicit, parameterized, low-rank dynamical system. See the follow-up (Kwiatkowski and Lipson 2019) for differentiable variants.
3.4 Attention Schema Theory
Graziano (Graziano 2013; Graziano et al. 2019) argues that consciousness is the brain’s (incomplete, schematic) model of its own attention. This is explicitly a reduced-rank model: the schema is coarser than the machinery it represents, because representing attention in full would require as much machinery as attention itself. Kaplan, Dolan, and colleagues have attempted to operationalize this in neural-network models (Wilterson and Graziano 2021).
3.5 Schmidhuber and reflective learners
Schmidhuber’s early work on self-referential neural networks (Schmidhuber 1993) and later Gödel machines (Schmidhuber 2003) formalizes learners that inspect and modify their own code, subject to provability constraints. The Gödel-machine construction is where the proof-theoretic aspect of self-modelling seems to cause grief: self-modification is gated by a proof that the modification improves expected utility.
3.6 Predictive coding and hierarchical self
Hierarchical predictive coding architectures (Rao and Ballard 1999; Clark 2013) include top-down predictions that span the whole sensory hierarchy, including proprioceptive and interoceptive signals — i.e., representations of the organism. See predictive coding, again. And harder.
3.7 Reflective LLM agents
A lighter ML lineage treats self-reflection as a control loop over the decoder rather than as a learnt world-model. Reflexion (Shinn et al. 2023) keeps a log of past behaviour, self-critiques, and revised plans in the context window, so the agent adjusts across episodes without any gradient update — verbal reinforcement learning rather than the usual kind. Language Agent Tree Search (Zhou et al. 2024) extends this to a Monte-Carlo tree search over action paths, with an LLM-powered value function and self-reflection pruning unpromising branches. The self-model here is an in-context transcript and the reflection is a prompt — coarse compared with active inference or Dreamer, and with no mechanism for the self-model to outlive the context. I include them as an existence proof for a reduced-rank self-model whose entire cost lives at inference time rather than in training, which is a distinct point on the budget frontier from everything above.
4 Reducing fidelity of representation
Several toolkits formalize “reduced rank”:
- Rate-distortion theory applied to cognition (Sims, Jacobs, and Knill 2012; Zénon, Solopchuk, and Pezzulo 2019; Lai and Gershman 2021): the cost of mental representation is an information-theoretic rate, the benefit is task performance, and optimal bounded agents sit on the rate-distortion frontier.
- Information bottleneck (Tishby, Pereira, and Bialek 2000; Alemi et al. 2019): compress inputs to a latent that is maximally informative about a downstream variable. When the downstream variable is “the other agent’s next action”, the bottleneck induces a reduced-rank other-model.
- Resource-rational analysis (Lieder and Griffiths 2020): agents are optimal given bounded compute; the bound is the reduction.
- Successor representations / features (Dayan 1993; Barreto et al. 2017): compressed future-prediction models that generalize well across reward functions. A kind of reduced-rank self-model of one’s own policy.
- Bounded rationality as a research programme (Simon 1955; S. J. Russell and Subramanian 1995; S. Russell 2016).
- Epsilon-machines / computational mechanics (Crutchfield and Young 1989; Shalizi and Crutchfield 2000): minimal-sufficient-statistic models of a process. The causal states are the minimum-rank predictor.
My research agent further recommends we look at theory of mind as mutual information (Jara-Ettinger 2019) and the recent graph-theoretic accounts of social abstraction (Stolk, Verhagen, and Toni 2016).
5 Multi-agent self
The bicameral intuition — “the mind is many minds talking” — turns up across Minsky’s Society of Mind (Minsky 1986), Global Workspace Theory (Baars 1993; Dehaene 2014) and its neural-network and RL formalizations (VanRullen and Kanai 2021; Goyal et al. 2021), mixture of experts (Jacobs et al. 1991; Shazeer et al. 2017), and Dennett’s multiple drafts (Dennett 1993). They share an architectural claim: the “self” is what falls out of specialist modules competing for a low-capacity shared bottleneck, and that bottleneck is where the reduced-rank self- and other-models have to sit. Attention Schema Theory (above) fits here too. Fuller treatment at multi-agent self.
6 Strange loops
I Am a Strange Loop (Hofstadter 2008) is Hofstadter restating the central argument of Gödel, Escher, Bach decades later because, by his own account, everyone read the first book for the fugues and missed it. The claim: the “I” is not the thing that has a self-model; the “I” is the self-model. A brain past a threshold of representational capacity cannot avoid coining a symbol for the system it finds itself to be, for the same reason Gödel could make Principia Mathematica talk about Principia Mathematica — past that threshold, self-reference stops being optional. The symbol is coarse, necessarily; this is the opening observation again, the only faithful model of Bob being Bob. And it is causally load-bearing: actions get computed through the self-symbol, which must then re-represent the behaviour it just helped produce. A representation steering the thing it represents, around and around, is the loop; the level-crossing — a coarse abstraction pushing on the micro-machinery that implements it — is what earns the adjective.2
The relevance to this notebook’s title is that a homunculus theory owes us an account of who watches the watcher, and there are two ways to pay. The chain can terminate: each inner model is coarser than its owner, and a few rungs down the innermost agent has decayed into noise. Every construction in the other-modelling section above takes this exit, and the depth profiles below formalize it. Or the chain can close: followed far enough, the watcher turns out to be the watched, met again after a trip around the loop. Hofstadter’s bet is that selves take the second exit, and that closure is a fixed point rather than a paradox.
His standing physical model is video feedback: point a camera at its own monitor and the screen fills with nested corridors, spirals, reverberating structure not present in the room. What I had not previously registered is where the stability of those patterns comes from. Each pass through the loop re-represents the whole scene at the camera’s resolution, so the regress of frames-within-frames does not run to infinity; it runs to the grain of the sensor and ends in blur. The lossiness is not a defect of the loop. It is why the iteration settles at all.
In this post’s vocabulary that inverts something I have been assuming. A full-rank self-model is precisely the case where closure goes wrong: a lossless self-representation supports diagonalization — the agent can always compute the predictor-defeating act, the same pathology that puts cycles into a mechanized causal graph when what gets modelled is output rather than mechanism (MacDermott, Everitt, and Belardinelli 2023) — and Löb’s theorem is the toll on demanding that the loop close provably (Yudkowsky and Herreshoff 2013). A reduced-rank self-model coarse-grains on every pass; distinctions wash out under iteration; the tower of self-in-self-in-self converges the way the video corridor converges, into blur. I have no theorem. Compression and contraction are different properties that happen to rhyme, and the rhyme may be all there is. But the budget frame below prices rank-reduction as a cost extracted from reflectivity, and Hofstadter’s picture runs the dependency the other way: reduced rank is what makes reflectivity converge. The homunculus can regulate the head only because it is smaller than the head.
There is also an accounting reading. A loop is a tower rolled into a ring: nominal depth infinite, but the same model is reused on every pass, so its rank is paid for once. Contrast unrolling a mechanized MAID into an extensive-form game, which pays for depth in nodes, exponentially (Hammond et al. 2023); the kin are SOM, which pays for an other-model by reusing the agent’s own policy (Raileanu et al. 2018), and the quine, which pays for self-reproduction by reusing code as data. Rolling the tower into a ring moves the cost from storage to convergence — how many passes, settling to what error — which is at least a quantity that rate-distortion-style tools could conceivably price.
The book also contains, almost in passing, the most direct pre-formal statement of the reduced-rank other-model I have encountered: knowing a person is hosting a low-resolution copy of their loop, each of us running the self-symbols of everyone we are close to at whatever fidelity contact has paid for, the copy differing from the original in rank rather than in kind. (The book is organized around grief — the surviving copy of his late wife’s loop is the worked example.) This adds a wrinkle to the Bob-and-Alice setup at the top: when the thing being compressed is itself a self, what Bob carries is a compressed loop — Alice’s self-symbol, not merely her policy — and two people modelling each other host a pair of mutually embedded loops. Program equilibrium makes the same shape exact: modal agents conditioning on one another’s source code close a two-agent loop by Löbian handshake (Bárász et al. 2014; Critch 2019) — a mutual strange loop with the poetry stripped out.
Where does this leave the formal programme? The self-consistency cluster below reads naturally as the project of closing Hofstadter’s loop exactly — fixed points over randomized answers, beliefs converging on themselves in the limit, proof-gated self-modification — with the Löbian obstacle as the price of exactness. And the mechanized-causal-graph pivot flagged in the callout up top inherits a sharp question. In that formalism a cycle is a symptom: it appears when Bob models Alice’s output instead of her mechanism. On Hofstadter’s account a cycle at the right level just is the self. Whether the mechanism layer can host a benign cycle — and what “benign” costs in rank — looks like the version of this post’s question that survives the pivot.
7 Self-referential agents and the proof-theoretic frontier
This is where reflectivity — the pressure for an agent’s beliefs about its own future beliefs to be consistent — becomes a first-class concern. If we want to prove things about minds that model themselves, we hit self-reference, and self-reference hits Löb’s theorem and friends if we’re not careful. As presaged, I am not super excited about the parts of this line of work that edge into unbounded compute.
Reflectivity is operationalized differently by each of the constructions below, and each buys self-consistency in a different coin — fixed-point iteration, market mixing, proof search — so its “cost” lives on a different axis depending on who we ask. I’ll treat its cost as real but construction-dependent.
- The self-consistency cluster. Reflective oracles (Fallenstein, Taylor, and Christiano 2015) (a fixed-point construction of probability distributions closed under self-reference), logical induction (Garrabrant et al. 2020) (a market-based learner whose beliefs about its own future beliefs converge), and the Löbian obstacle (Yudkowsky and Herreshoff 2013) (the negative result that a self-modifying agent unwilling to endorse a weaker-or-equal successor runs into Löb’s theorem): three constructions attacking the same self-consistency problem in three different coins.
- Infra-Bayesianism (Kosoy and Appel 2020) replaces the prior — a single measure over environments — with a convex set of sub-probability measures (a credal set, in the imprecise-probability sense), and decides by minimax over the set. Two things this buys at once: non-realizability (the true environment need not be in the support of any single measure, so the Big-World problem from agency under bounded compute is no longer an outright contradiction), and a way to dodge some self-reference paradoxes by letting the agent’s beliefs about its own future actions be imprecise rather than committed to a specific distribution. Infra-Bayesian RL has regret bounds against non-realizable environments, and the infra-Bayesian physicalism extension (Kosoy 2021) targets embeddedness directly. In the budget frame it spends on reflectivity via imprecision, in a different coin from fixed-point iteration, market mixing, or proof search — and like the others, it spends nothing on bounded compute.
- Modal combat agents and program equilibrium (Bárász et al. 2014; Critch 2019): agents that condition on source-code-level models of each other; admits bona fide equilibria in the one-shot prisoner’s dilemma.
- AIXI and approximations (Hutter 2005; Leike et al. 2016): a formally optimal agent whose self-model is implicit in the universal prior; computable approximations (e.g., AIXI-tl, MC-AIXI) buy tractability at the cost of reducing the rank of the prior.
MIRI’s Agent Foundations programme is the main hub for this line of work.
8 Let’s attempt synthesis!
The three axes above — other-modelling, self-modelling, reduced-rank — are not independent knobs an architect sets arbitrarily. I’d argue they are three demands on the compute embedded in a single joint representation carried by a social agent — depth of recursion, rank of each sub-model, and reflectivity — competing for the same finite resources. This is of a piece with my general heuristic that we should always think about where to spend our compute to make sense of the AI landscape as it exists.
“Compute” here is a cover term: it lumps together bits of representation at rest, operations per decision, and data to fit the representation in the first place. These don’t fully interchange, and nobody defines a fungible unit that resolves the trade-offs cleanly AFAIK. For the argument that follows the weaker claim is enough — that the three axes all draw on a shared pool, and that each spends it through its own, largely unknown, return on investment. Rate-distortion theory gives a concave \(R(D)\) for rank (Sims, Jacobs, and Knill 2012); the analogous curves for recursion depth and reflectivity are open problems.
On this reading the existing literature is a tour of corner solutions: each framework spends its compute on one or two axes and lets the rest go to zero.
- I-POMDPs and level-\(k\) / cognitive hierarchy: spend on depth; each level is a cartoon of the next. Humans clustering at \(k \in \{0,1,2\}\) is consistent with a small budget.
- Rate-distortion cognition, information bottleneck, successor features: spend on rank at fixed depth (\(k=1\)); reflectivity ignored.
- Reflective oracles, logical induction, Gödel machines, infra-Bayesianism: spend on reflectivity; other-structure kept simple enough that the construction survives.
- Strange loops (Hofstadter 2008): all-in on closure; depth amortized into the cycle; rank acknowledged — the self-symbol is coarse by construction — but never priced.
- Active inference: allocation across self- and world-model; depth and reflectivity implicit.
- LOLA / opponent shaping: \(k=2\); rank inherited from the opponent’s parameter count.
- World models (Ha/Schmidhuber, Dreamer, MuZero): storage in the latent dimension; \(k=1\); reflectivity absent.
- ToMnet and I-POMDP-Net (Han and Gmytrasiewicz 2019): learned reduced-rank other-models; \(k \le 2\); reflectivity absent.
- SOM (Raileanu et al. 2018): \(k=1\) with self-model and other-model sharing bits; reflectivity absent. A case of the axes interacting rather than competing.
- Reflexion / LATS (Shinn et al. 2023; Zhou et al. 2024): inference-time reflectivity over an in-context self-model; no learnt other-model; rank bounded by the context window.
8.1 Depth as a profile
The “depth” axis above reads as a single number \(k\) in level-\(k\) and I-POMDP treatments, but that collapses some structure worth pulling apart. Bob regulates a scene that contains Alice; there is a natural partial ordering on what kind of thing Alice can be inside Bob’s world-model.
- Alice as noise — a draw from a distribution over behaviours. No agent-shaped variables in Bob’s ontology.
- Alice as a stateful process — memory, non-Markovian dynamics, possibly a “type” Bob is inferring. Not intentional.
- Alice as a belief-carrier whose model of Bob is type-0. She has goals, she is optimizing against a world-model, but that world-model contains Bob only as noise.
- Alice as a belief-carrier whose Bob-model is type-1. She conditions on Bob’s type but does not treat him as intentional.
- Alice as a belief-carrier whose Bob-model is itself belief-carrying. Recursion; the same classification applies at the next rung.
Scalar \(k\) is the uniform case: every rung up to depth \(k\) is type-4, and the chain terminates in type-0 at the bottom. What the scalar flattens is the profile — a sequence of (ontology type, rank) pairs along the chain, which in general need not be uniform. In particular it throws away asymmetric drop-off: “Alice thinks I am nearly noise” (type-2 terminal) and “Alice thinks I am stateful” (type-3 terminal) are both depth-2 in scalar terms, but play differently. In the first Bob has slack to act unpredictably; in the second Alice is already conditioning on his type, and only type-misrepresentation is exploitable. Budget-constrained agents probably allocate non-uniformly — most of their rank on the first rung or two, terminating early into type 0 or 1. The human plateau at \(k \in \{0,1,2\}\) is at least as consistent with this as with “humans can’t recurse further”, and the shallow-but-lopsided flavour of LLM-ToM failure (Ullman 2023) looks similar.
Two scalars suggest themselves in place of \(k\) and they answer different questions: chain length (the traditional quantity), and total rank spent on reflection, \(\sum_i R_i\) along the chain. The budget frame of this post is more naturally the second; the level-\(k\) literature has mostly used the first.3
Both scalars, and the profile itself, assume the chain is a chain. A self-model is the case where the chain closes instead of terminating, and the profile picks up a cycle — which scalar \(k\) cannot write down at all. The strange-loop section above argues the cycle is not a degenerate case; it is where the “self” part of the programme lives.
Humans, LLMs, and the agents we would actually like to build do not sit at any corner. They are interior points, and the interior is where unwritten papers live.
9 Angles of attack
I got an LLM to ideate below ideas for me. Lightly edited for baseline sanity. The budget view suggests several angles of attack on what we actually care about — understanding how real cognition, human or machine, allocates its compute across world, self, and other. Two concrete ones, plus a brief aside on scaling the same calculus to collectives.
9.1 Finding \(\Phi\) in a transformer
The synthesis above says a bounded social agent carries a structured latent — call it \(\Phi\) — partitioned into world, self, each other, and reflective bookkeeping. The ToM-in-LLMs literature (Kosinski 2023; Ullman 2023) presently seems to treat these pieces as present-or-absent behavioural properties. If a frontier model has acquired any social competence under finite pretraining compute, there should be detectable structure in \(\Phi\) tracking these pieces — not necessarily a neat partition, because superposition lets a learner share parameters across sparse features, but some signature visible to the right probe.
That is a mechanistic-interpretability problem. Run sparse-autoencoder or dictionary-learning analysis on the residual stream during ToM-style tasks, and look for:
- a low-rank subspace whose ablation selectively breaks co-agent prediction without breaking world-prediction;
- a disjoint self-model subspace whose ablation breaks in-character behaviour but not factual recall;
- approximately no reflective-bookkeeping subspace in base models, and a small one in post-RLHF models — which are rewarded for consistency with a remembered “I”, and so should be under pressure to allocate bits to reflective bookkeeping that a base model is not.
It is also the careenium question (Hofstadter 2008) with an effect size attached — Hofstadter asks whether the causal story of a system is better told over coarse symbols (his simmballs, groan) than over the micro-dynamics implementing them; ablation asks the residual stream the same thing.
If the structure is not there, one of three things is wrong: the framework, our assumption that representational storage is actually binding for this class of model, or our belief that the pretraining signal actually rewards social prediction rather than a confound. Each possibility is informative. If the structure is there, we have an interpretability handle on social cognition specifically, rather than features in general.
9.2 A value-of-reflectivity calculation
The MIRI tradition treats reflectivity as a correctness property: a reasoner that cannot consistently model its own reasoning has a bug, and the whole agent foundations edifice is a search for bug-free constructions. The budget view recasts it as an economic question: when is it worth spending bits on reflective machinery, and when is it better to skip it?
An idea in this space is that reflectivity pays in proportion to the reactivity of the environment — the degree to which other agents, or future versions of ourselves, condition on our internal commitments rather than just our past actions. Non-reactive one-shot stateless POMDPs: reflectivity is dispensable. Program-equilibrium and modal-combat settings (Bárász et al. 2014; Critch 2019): reflectivity must be high, because opponents are literally reading our source. Open-ended self-modification (Gödel-machine style): Löbian obstacles arise exactly because reflectivity has to be preserved while the substrate is mutating underneath it.
A testable version of this, without the formalism: build toy environments in each regime — a non-reactive bandit, a modal-combat tournament, a self-modification sandbox — and train the same architecture across all three. The conjecture predicts that reflective structure (by whatever detector the \(\Phi\) angle supplies) should emerge spontaneously in the second and third, and not the first. That is a scaling claim about spontaneous situational awareness under training pressure, and it falsifies in either direction.
Premakumar et al. (Premakumar et al. 2024) train networks with an auxiliary task of predicting their own internal activations and observe narrower weight distributions and reduced effective complexity — a “self-prediction-as-regularizer” effect. As a bonus the regularized networks are easier for other networks to model. If the effect replicates beyond their setting then reflectivity is paying for itself partly in rank elsewhere, which is the shape of trade-off the budget frame predicts. It also gives an interpretability-friendly story about alignment: agents incentivised to self-predict become more predictable to overseers, for free. The LessWrong writeup makes the alignment spin explicit.
9.3 Collectives are instrumentable
The same budget calculus should scale past individual minds. A firm, team, or agency is a social agent carrying compressed models of itself and its counterparties — self-model embodied in policy and narrative, other-models in competitor dossiers and customer segments, reflectivity in governance and audit. This complements the transformer angle above: organizations are, if anything, easier to instrument than LLMs — meeting notes, decision logs, communication graphs, and promotion criteria are legible in a way residual-stream activations are not. Mapping those proxies to rank, depth, and reflectivity is a separate project, but the framework commits to specific organizational over-allocations (governance theatre, regulatory capture, siloed product teams, academic insularity) being allocation failures of the same kind as the cognitive ones, which gives two independent scales at which to play with it.
11 Incoming
- Hofstadter and Dennett’s The Mind’s I as a pre-formal reading list.
- Tononi’s IIT as a measure on self-models, if one is feeling brave.
- Whether the “observer self” of contemplative traditions corresponds to an attention-schema-style reduced-rank self-model — this seems to be Metzinger’s read.
12 References
Footnotes
Alternatively it could be a full-rank model, which gets very weird and makes people worry about Löb’s theorem. That is not the main focus here↩︎
Selfhood comes in sizes on this account, measured in hunekers, after the music critic who warned that small-souled men should not attempt a particular Chopin étude.↩︎
An observer-relative formulation of what it means for Bob to have a model at all is in Virgo et al. (2025), sitting alongside the internal-model-principle thread at internal models. Their construction gives a floor on what counts as modelling — one among several plausible candidates, with differing divergence and ontology commitments.↩︎

