Learning with theory of mind

What collective learning looks like from the individual agent’s perspective

2025-05-03 — 2025-05-05

Wherein Agents Are Taught to Model and Influence Other Agents’ Learning, and Opponent‑shaping Techniques in Reinforcement Learning Are Examined as Means to Induce Targeted Updates in an Opponent’s Policy.

adaptive

agents

bandit problems

bounded compute

collective knowledge

control

cooperation

distributed

economics

evolution

extended self

game theory

incentive mechanisms

learning

machine learning

mind

networks

utility

Learning agents in a multi-agent system which account for and/or exploit the fact that other agents are learning too. This is one way of formalising the idea of theory of mind — modelling not just what another agent believes, but how it updates those beliefs.

This framing connects to opponent shaping (the RL formalisation), AI agency and AI alignment, institutional alignment, collective action problems, incentive mechanisms, iterated game theory, and even what causes a “self” to be a meaningful unit of analysis.

I do not think this is likely to be a sufficient explanation of agentic cognition. This seems more like something useful for formalising local dynamics for a system in a regular configuration, such as a market or a personal relationship. Does it help us formalise the open-system fuzzy-boundaries dynamics?

1 Asymmetric: learning to make your opponent learn

I was first switched on to this idea in the asymmetric form by Dezfouli, Nock, and Dayan (2020), which describes a way to learn to make your opponent learn — actively choosing actions to induce specific updates in the other agent’s policy.

The key asymmetry: one agent has a model of the other’s learning rule and optimises over it, while the other agent is “naive” (or at least unaware of being shaped). This is distinct from opponent shaping (below), where both agents may be aware of the shaping dynamic.

The symmetric form, where both agents are in the same learning loop modelling each other, is also interesting but less well-studied in the asymmetric-influence framing.

2 Opponent shaping

Opponent shaping is a reinforcement learning-meets-iterated game theory formalism in which agents influence each other by using models of the other agents’ update rules. The formalism has evolved from computationally expensive higher-order gradient methods (LOLA) to tractable first-order approaches (Advantage Alignment).

This has its own notebook with a full treatment.

3 Assistance games and cooperative value learning

When the asymmetry is cooperative rather than adversarial — one agent trying to help another whose goals are unknown — the natural framing is assistance games, a.k.a. cooperative inverse reinforcement learning. See the value learning page for a full treatment of inferring reward functions from behaviour.

Related work on building machines that learn and think with people: Collins et al. (2024) frames human-AI interaction as a collaborative theory-of-mind problem, and Ying et al. (2025) develops language-augmented Bayesian theory of mind for understanding epistemic language.

4 Belief-based learning (without theory of learning)

A simpler version of the problem: modelling what the other agent believes, without modelling how it learns.

ReBeL: A general game-playing AI bot that excels at poker and more is a good example:

Recursive Belief-based Learning (ReBeL) is a general RL+Search algorithm that can work in all two-player zero-sum games, including imperfect-information games. ReBeL builds on the RL+Search algorithms like AlphaZero that have proved successful in perfect-information games. Unlike those previous AIs, however, ReBeL makes decisions by factoring in the probability distribution of different beliefs each player might have about the current state of the game, which we call a public belief state (PBS).

By accounting for the beliefs of each player, ReBeL treats imperfect-information games akin to perfect-information games. The gap between this and full theory-of-mind is precisely the gap between modelling what your opponent thinks and modelling how your opponent updates.

5 Cooperative AI as a research agenda

The Cooperative AI Foundation is developing the broader research programme. See also Full-Stack Alignment, which frames the problem as co-aligning AI and institutions with thick models of value (Klingefjord, Lowe, and Edelman 2024).

6 Incoming

Artificial Communication: How Algorithms Produce Social Intelligence

7 References

Aghajohari, Duque, Cooijmans, et al. 2023. “LOQA: Learning with Opponent Q-Learning Awareness.” In.

Balaguer, Koster, Summerfield, et al. 2022. “The Good Shepherd: An Oracle Agent for Mechanism Design.” In.

Collins, Sucholutsky, Bhatt, et al. 2024. “Building Machines That Learn and Think with People.”

Conitzer, and Oesterheld. 2023. “Foundations of Cooperative AI.” Proceedings of the AAAI Conference on Artificial Intelligence.

Cooijmans, Aghajohari, and Courville. 2023. “Meta-Value Learning: A General Framework for Learning with Learning Awareness.”

Critch, Dennis, and Russell. 2022. “Cooperative and Uncooperative Institution Designs: Surprises and Problems in Open-Source Game Theory.”

Da Costa, Gavenčiak, Hyland, et al. 2025. “Possible Principles for Aligned Structure Learning Agents.”

Dafoe, Hughes, Bachrach, et al. 2020. “Open Problems in Cooperative AI.”

Deng, Papadimitriou, and Safra. 2002. “On the Complexity of Equilibria.” In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing. STOC ’02.

Dezfouli, Nock, and Dayan. 2020. “Adversarial Vulnerabilities of Human Decision-Making.” Proceedings of the National Academy of Sciences.

Dong, Li, Yang, et al. 2024. “Egoism, Utilitarianism and Egalitarianism in Multi-Agent Reinforcement Learning.” Neural Networks.

Duque, Aghajohari, Cooijmans, et al. 2025. “Advantage Alignment Algorithms.” In.

Fickinger, Zhuang, Hadfield-Menell, et al. 2020. “Multi-Principal Assistance Games.”

Foerster, Chen, Al-Shedivat, et al. 2018. “Learning with Opponent-Learning Awareness.” In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’18.

Foerster, Farquhar, Al-Shedivat, et al. 2018. “DiCE: The Infinitely Differentiable Monte-Carlo Estimator.”

Hadfield-Menell, Dragan, Abbeel, et al. 2016. “Cooperative Inverse Reinforcement Learning.” In Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16.

Hadfield-Menell, and Hadfield. 2018. “Incomplete Contracting and AI Alignment.”

Khan, Willi, Kwan, et al. 2024. “Scaling Opponent Shaping to High Dimensional Games.” In Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems. AAMAS ’24.

Klingefjord, Lowe, and Edelman. 2024. “What Are Human Values, and How Do We Align AI to Them?”

Laidlaw, Bronstein, Guo, et al. 2025. “AssistanceZero: Scalably Solving Assistance Games.” In Workshop on Bidirectional Human↔︎AI Alignment.

Leibo, Zambaldi, Lanctot, et al. 2017. “Multi-Agent Reinforcement Learning in Sequential Social Dilemmas.”

Levin. 2019. “The Computational Boundary of a “Self”: Developmental Bioelectricity Drives Multicellularity and Scale-Free Cognition.” Frontiers in Psychology.

Loewith, and Street. 2025. “Mutual Prediction in Human–AI Coevolution.” Antikythera Digital Journal.

Lu, Willi, Witt, et al. 2022. “Model-Free Opponent Shaping.” In Proceedings of the 39th International Conference on Machine Learning.

Lyons, and Levin. 2024. “Cognitive Glues Are Shared Models of Relative Scarcities: The Economics of Collective Intelligence.”

Meulemans, Kobayashi, Oswald, et al. 2024. “Multi-Agent Cooperation Through Learning-Aware Policy Gradients.” In.

Sharma, Davidson, Khetarpal, et al. 2024. “Toward Human-AI Alignment in Large-Scale Multi-Player Games.”

Smith, and Krishnamurthy. 2011. “Symmetry and Collective Fluctuations in Evolutionary Games:”

———. 2015. Symmetry and Collective Fluctuations in Evolutionary Games: IOP Expanding Physics.

Tarai, and Bit, eds. 2021. Neurocognitive Perspectives of Prosocial and Positive Emotional Behaviours: Theory to Application.

Willi, Letcher, Treutlein, et al. 2022. “COLA: Consistent Learning with Opponent-Learning Awareness.” In Proceedings of the 39th International Conference on Machine Learning.

Xie, Losey, Tolsma, et al. 2021. “Learning Latent Representations to Influence Multi-Agent Interaction.” In Proceedings of the 2020 Conference on Robot Learning.

Ying, Zhi-Xuan, Wong, et al. 2025. “Understanding Epistemic Language with a Language-Augmented Bayesian Theory of Mind.”