Multi-Agent Reinforcement Learning

Distributed sensing, swarm sensing, adaptive social learning, multi-agent adaptation, iterated game theory with learning etc

2014-10-13 — 2026-05-27

Wherein the Formation of Agent Coalitions Is Treated as a Learning Problem, Classical Exponential Complexity Is Noted, and Cooperative Game Theory Concepts Including the Nucleolus Are Adopted.

agents
AI safety
bounded compute
collective knowledge
computers are awful together
distributed
economics
edge computing
extended self
game theory
incentive mechanisms
machine learning
networks
Figure 1

Placeholder for notes on multi-agent reinforcement learning (MARL).

MARL is a big topic, which I cannot hope to introduce here, but there are some sub-topics I might get around to.

1 Classic Collectives

COINs etc. TBD.

2 Coalitional MARL

A coalition game asks who teams up with whom and how the joint surplus is split. The classical answer leans on a characteristic function \(v(S)\) — the worth of every coalition \(S\), evaluated in isolation. We might recall this from Shapley fairness. For more than ~20 agents that’s already hopeless: evaluating optimal coalitions involves \(O(2^n)\) subproblems, each of which might be hard in itself (e.g. each \(v(S)\) is its own routing problem in cooperative logistics). Scaling to a thousand agents would mean evaluating \(10^{301}\) such problems.

2.1 Decentralised partner selection and gain-sharing

MARL takes another path: let agents learn who to ally with and how to share the spoils, from experience, without ever computing a full \(v\). This pushes the agent count up by learning an approximation to the characteristic function. Multi-Agent Gain Sharing (MAGS) trains transformer-based agents to negotiate fair gain shares directly in mixed-motive logistics environments, notionally scaling to >1000 concurrent agents (Sun2024Scalable?).

(Chalkiadakis2003Coordination?) treats unknown opponent capabilities as a POMDP and has agents do Bayesian belief updates over opponent types during coalition formation, which lets them anticipate (and sometimes exploit) irrational or static partners (Chalkiadakis2003Coordination?).

Both approaches experience a failure mode called transductivity: agents learn to coordinate with the opponents they trained against, not with arbitrary new ones (which would be inductivity).

2.2 Model-based MARL for stable partitions

Traditionally, MARL is model-free. Model-based MARL — where agents learn (or are handed) a predictive model of environment dynamics — seems rarely used in coalition formation, but it buys us something the model-free version can’t: theoretical stability guarantees.

I found a worked example, in radar-network multitarget tracking (Geng2023Coalition?). Several geographically separated radar stations, each with limited beam-time, jointly track many more moving targets than any one station can cover. Two or three radars pointing at the same target triangulate to a lower-variance estimate than any one alone — that’s the cooperation gain. A coalition is the subset of radars agreeing to illuminate one target; the global partition assigns radars to targets. “Transferable utility” means the joint tracking-quality improvement is a scalar that splits among coalition members. No central control dictates assignments; each radar runs a model-based RL policy that learns the value of joining one coalition versus another, and the authors prove convergence to a Nash-stable partition at which no single radar wants to defect. The model-based part seems important. Nash-stability is a statement about counterfactual defections — for every alternative coalition each radar could switch to, its simulated payoff must be no better than staying put. Model-free MARL only has reliable value estimates along trajectories it actually sampled, so the off-policy counterfactuals are guesses, and the stability claim can’t be certified. A learned dynamics model lets each radar roll out counterfactual predictions along the lines of “what if I retargeted at \(k’\)”, without physically doing so, and check the deviation locally. The pattern (model-constrained dynamics + transferable-utility coalitions + model-based RL → stability proof) probably generalises beyond radar.

2.3 Credit assignment via the nucleolus

A naïve way of doing MARL is to force a grand coalition — all agents share one global reward — and rely on credit-assignment heuristics like COMA (Jakob Foerster, Farquhar, et al. 2018) to back out individual contributions. In asymmetric environments (e.g. few cooperators vs. many adversaries in SMAC) the grand coalition might be suboptimal. (Wang2025Nucleolus?) import the concept of the nucleolus from cooperative game theory into Q-learning and approximate it. For an allocation \(x\), a coalition \(S\)’s excess is \(e(S, x) = v(S) - \sum_{i \in S} x_i\) — the gap between what \(S\) could earn alone and what its members get under \(x\). Positive excess means there are grounds to defect. The nucleolus is the allocation that makes the most-aggrieved coalition as un-aggrieved as possible, then the next, and so on. A nucleolus-Q operator with convergence guarantees lets agents autonomously fracture the grand coalition into smaller, task-specific sub-teams, each stable in the sense that no member wants to defect.

2.4 Opponent shaping

See opponent shaping.

2.5 Noisy observation of partner intent

In a mixed-motive game where every agent is purely selfish, MARL notoriously finds a mutually-defective equilibrium and stays there. One fix is to hard-code prosocial weights — give each agent a reward of the form \(r_i + \alpha \sum_{j \neq i} r_j\), with \(\alpha\) tuned so cooperation pays. That works, apparently, but is not satisfying. It bakes the cooperation in rather than letting it be discovered; it tends to be fragile to defectors, doesn’t generalise off-distribution, and gives no principled story for who should cooperate with whom.

The Randomized Uncertain Social Preferences (RUSP) framework (Baker 2020) solves this by training agents across a distribution of prosocial weights, where each agent only gets a noisy observation of its own weights and no information about others’. At each episode start, we sample a reward-transformation matrix \(W\), where \(W_{ij}\) is the weight agent \(i\) puts on agent \(j\)’s reward, then have agent \(i\)’s shaped reward be \(\sum_j W_{ij} r_j\). Each agent observes only a noisy version of its own row (so it isn’t sure how much it should care about each partner), and nothing about other agents’ rows (so it can’t tell which partners care about it). Training is across the whole distribution of \(W\)s, not a single fixed one.

Because partner intent isn’t observable, the only way an agent can do well in expectation is to read intent from behaviour — cooperate cautiously, escalate cooperation when reciprocated, punish when defected against. The reported emergent phenomena are: direct reciprocity (tit-for-tat-ish responses to recent behaviour), indirect reputation tracking (treating an agent based on how it treated third parties), and stable in-episode team formation (subsets of agents settling into cooperative pairs while the rest go it alone). None of these are programmed in; they’re the policies that generalise across the \(W\)-distribution. Unlike fixed-prosocial-weights training, the policies reportedly transfer to held-out partner mixes without retraining.

2.6 Open-source game theory

MARL is one possible operationalisation of open-source game theory, in which agents exchange policy source code (or some interpretable proxy) before acting and cooperate by mutual verification.

3 References

Albrecht, Christianos, and Schäfer. 2024. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches.
Amato. 2024. An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning.”
Bachrach, Everett, Hughes, et al. 2020. Negotiating Team Formation Using Deep Reinforcement Learning.” Artificial Intelligence.
Baker. 2020. Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences.” In.
Barasz, Christiano, Fallenstein, et al. 2014. Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic.”
Bieniawski, and Wolpert. 2004. Adaptive, Distributed Control of Constrained Multi-Agent Systems.” In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 3.
Cao, Lazaridou, Lanctot, et al. 2018. Emergent Communication Through Negotiation.”
Chalkiadakis. 2007. A Bayesian Approach to Multiagent Reinforcement Learning and Coalition Formation Under Uncertainty.”
Cooper, Oesterheld, and Conitzer. 2024. Characterising Simulation-Based Program Equilibria.”
Critch. 2016. Parametric Bounded Löb’s Theorem and Robust Cooperation of Bounded Agents.”
———. 2017. Toward Negotiable Reinforcement Learning: Shifting Priorities in Pareto Optimal Sequential Decision-Making.”
Dong, Li, Yang, et al. 2024. Egoism, Utilitarianism and Egalitarianism in Multi-Agent Reinforcement Learning.” Neural Networks.
Du, and Ding. 2021. A Survey on Multi-Agent Deep Reinforcement Learning: From the Perspective of Challenges and Applications.” Artificial Intelligence Review.
Duque, Aghajohari, Cooijmans, et al. 2025. Advantage Alignment Algorithms.” In.
Fickinger, Zhuang, Hadfield-Menell, et al. 2020. Multi-Principal Assistance Games.”
Foerster, J. 2018. Deep Multi-Agent Reinforcement Learning.”
Foerster, Jakob, Chen, Al-Shedivat, et al. 2018. Learning with Opponent-Learning Awareness.” In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’18.
Foerster, Jakob, Farquhar, Afouras, et al. 2018. Counterfactual Multi-Agent Policy Gradients.” Proceedings of the AAAI Conference on Artificial Intelligence.
Franzmeyer, Malinowski, and Henriques. 2021. Learning Altruistic Behaviours in Reinforcement Learning Without External Rewards.” In.
Gronauer, and Diepold. 2022. Multi-Agent Deep Reinforcement Learning: A Survey.” Artificial Intelligence Review.
Hadfield-Menell, Dragan, Abbeel, et al. 2016. “Cooperative Inverse Reinforcement Learning.” In Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16.
Ha, and Tang. 2022. Collective Intelligence for Deep Learning: A Survey of Recent Developments.” Collective Intelligence.
Havrylov, and Titov. 2017. “Emergence of Language with Multi-Agent Games: Learning to Communicate with Sequences of Symbols.”
Hernandez-Leal, Kartal, and Taylor. 2019. A Survey and Critique of Multiagent Deep Reinforcement Learning.” Autonomous Agents and Multi-Agent Systems.
Jaques, Lazaridou, Hughes, et al. 2019. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning.” In Proceedings of the 36th International Conference on Machine Learning.
Jiang, Su, and Lu. 2024. Fully Decentralized Cooperative Multi-Agent Reinforcement Learning: A Survey.”
Laidlaw, Bronstein, Guo, et al. 2025. AssistanceZero: Scalably Solving Assistance Games.” In Workshop on Bidirectional Human↔︎AI Alignment.
Lee, Leibo, An, et al. 2022. Importance of Prefrontal Meta Control in Human-Like Reinforcement Learning.” Frontiers in Computational Neuroscience.
Li, Cao, Qiao, et al. 2025. Nucleolus Credit Assignment for Effective Coalitions in Multi-Agent Reinforcement Learning.” In.
Lin, Zhu, Li, et al. 2025. Policy-Conditioned Policies for Multi-Agent Task Solving.”
Lowe, Foerster, Boureau, et al. 2019. On the Pitfalls of Measuring Emergent Communication.”
Lowe, Wu, Tamar, et al. 2020. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments.”
Meulemans, Kobayashi, Oswald, et al. 2024. Multi-Agent Cooperation Through Learning-Aware Policy Gradients.” In.
Meulemans, Nasser, Wołczyk, et al. 2025. Embedded Universal Predictive Intelligence: A Coherent Framework for Multi-Agent Learning.”
Ohsawa. 2021. Unbiased Self-Play.” arXiv:2106.03007 [Cs, Econ, Stat].
Oroojlooy, and Hajinezhad. 2023. A Review of Cooperative Multi-Agent Deep Reinforcement Learning.” Applied Intelligence.
Pant, and Yu. 2026. Coopetition-Gym V1: A Formally Grounded Platform for Mixed-Motive Multi-Agent Reinforcement Learning Under Strategic Coopetition.”
Peysakhovich, and Lerer. 2017. Prosocial Learning Agents Solve Generalized Stag Hunts Better Than Selfish Ones.”
Rădulescu. 2021. Decision Making in Multi-Objective Multi-Agent Systems A Utility-Based Perspective.”
Sharma, Fernandez, Zaroukian, et al. 2021. Survey of Recent Multi-Agent Reinforcement Learning Algorithms Utilizing Centralized Training.” In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III.
Sistla, and Kleiman-Weiner. 2025. Evaluating LLMs in Open-Source Games.”
Suarez. 2024. Neural MMO: Massively Multiagent Simulation and Learning.”
Suárez, Isola, Choe, et al. 2023. “Neural MMO 2.0: A Massively Multi-Task Addition to Massively Multi-Agent Learning.”
Tennant, Hailes, and Musolesi. 2023. Modeling Moral Choices in Social Dilemmas with Multi-Agent Reinforcement Learning.” In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence.
Tumer, and Wolpert. 2012. Collectives and the Design of Complex Systems.
Weis, Wołczyk, Nasser, et al. 2026. Multi-Agent Cooperation Through in-Context Co-Player Inference.”
Wolpert, David H. 2006a. “Advances in Distributed Optimization Using Probability Collectives.” Advances in Complex Systems.
———. 2006b. Information Theory — The Bridge Connecting Bounded Rational Game Theory and Statistical Physics.” In Complex Engineered Systems. Understanding Complex Systems.
Wolpert, David H, Bieniawski, and Rajnarayan. 2011. “Probability Collectives in Optimization.”
Wolpert, David H, and Lawson. 2002. Designing Agent Collectives for Systems with Markovian Dynamics.” In.
Wolpert, David H., and Tumer. 1999. An Introduction to Collective Intelligence.” arXiv:cs/9908014.
Wolpert, David H, Wheeler, and Tumer. 1999. General Principles of Learning-Based Multi-Agent Systems.” In.
———. 2000. Collective Intelligence for Control of Distributed Dynamical Systems.” EPL (Europhysics Letters).
Wulfmeier, Ondruska, and Posner. 2016. Maximum Entropy Deep Inverse Reinforcement Learning.”
Yang, Luo, Li, et al. 2018. Mean Field Multi-Agent Reinforcement Learning.” In Proceedings of the 35th International Conference on Machine Learning.