Learning with theory of mind
What collective learning looks like from the individual agent’s perspective
2025-05-03 — 2025-05-05
Wherein agents are taught to model and influence other agents’ learning, and opponent‑shaping techniques in reinforcement learning are examined as means to induce targeted updates in an opponent’s policy.
Learning agents in a multi-agent system which account for and/or exploit the fact that other agents are learning too. This is one way of formalising the idea of theory of mind — modelling not just what another agent believes, but how it updates those beliefs.
This framing connects to opponent shaping (the RL formalisation), AI agency and AI alignment, institutional alignment, collective action problems, incentive mechanisms, iterated game theory, and even what causes a “self” to be a meaningful unit of analysis.
I do not think this is likely to be a sufficient explanation of agentic cognition. This seems more like something useful for formalising local dynamics for a system in a regular configuration, such as a market or a personal relationship. Does it help us formalise the open-system fuzzy-boundaries dynamics?
1 Asymmetric: learning to make your opponent learn
I was first switched on to this idea in the asymmetric form by Dezfouli, Nock, and Dayan (2020), which describes a way to learn to make your opponent learn — actively choosing actions to induce specific updates in the other agent’s policy.
The key asymmetry: one agent has a model of the other’s learning rule and optimises over it, while the other agent is “naive” (or at least unaware of being shaped). This is distinct from opponent shaping (below), where both agents may be aware of the shaping dynamic.
The symmetric form, where both agents are in the same learning loop modelling each other, is also interesting but less well-studied in the asymmetric-influence framing.
2 Opponent shaping
Opponent shaping is a reinforcement learning-meets-iterated game theory formalism in which agents influence each other by using models of the other agents’ update rules. The formalism has evolved from computationally expensive higher-order gradient methods (LOLA) to tractable first-order approaches (Advantage Alignment).
This has its own notebook with a full treatment.
3 Assistance games and cooperative value learning
When the asymmetry is cooperative rather than adversarial — one agent trying to help another whose goals are unknown — the natural framing is assistance games, a.k.a. cooperative inverse reinforcement learning. See the value learning page for a full treatment of inferring reward functions from behaviour.
Related work on building machines that learn and think with people: Collins et al. (2024) frames human-AI interaction as a collaborative theory-of-mind problem, and Ying et al. (2025) develops language-augmented Bayesian theory of mind for understanding epistemic language.
4 Belief-based learning (without theory of learning)
A simpler version of the problem: modelling what the other agent believes, without modelling how it learns.
ReBeL: A general game-playing AI bot that excels at poker and more is a good example:
Recursive Belief-based Learning (ReBeL) is a general RL+Search algorithm that can work in all two-player zero-sum games, including imperfect-information games. ReBeL builds on the RL+Search algorithms like AlphaZero that have proved successful in perfect-information games. Unlike those previous AIs, however, ReBeL makes decisions by factoring in the probability distribution of different beliefs each player might have about the current state of the game, which we call a public belief state (PBS).
By accounting for the beliefs of each player, ReBeL treats imperfect-information games akin to perfect-information games. The gap between this and full theory-of-mind is precisely the gap between modelling what your opponent thinks and modelling how your opponent updates.
5 Cooperative AI as a research agenda
The Cooperative AI Foundation is developing the broader research programme. See also Full-Stack Alignment, which frames the problem as co-aligning AI and institutions with thick models of value (Klingefjord, Lowe, and Edelman 2024).
