Multi-Agent Reinforcement Learning
Distributed sensing, swarm sensing, adaptive social learning, multi-agent adaptation, iterated game theory with learning etc
2014-10-13 — 2026-05-27
Wherein the Formation of Agent Coalitions Is Treated as a Learning Problem, Classical Exponential Complexity Is Noted, and Cooperative Game Theory Concepts Including the Nucleolus Are Adopted.
Placeholder for notes on multi-agent reinforcement learning (MARL).
MARL is a big topic, which I cannot hope to introduce here, but there are some sub-topics I might get around to.
1 Classic Collectives
COINs etc. TBD.
2 Coalitional MARL
A coalition game asks who teams up with whom and how the joint surplus is split. The classical answer leans on a characteristic function \(v(S)\) — the worth of every coalition \(S\), evaluated in isolation. We might recall this from Shapley fairness. For more than ~20 agents that’s already hopeless: evaluating optimal coalitions involves \(O(2^n)\) subproblems, each of which might be hard in itself (e.g. each \(v(S)\) is its own routing problem in cooperative logistics). Scaling to a thousand agents would mean evaluating \(10^{301}\) such problems.
2.1 Decentralised partner selection and gain-sharing
MARL takes another path: let agents learn who to ally with and how to share the spoils, from experience, without ever computing a full \(v\). This pushes the agent count up by learning an approximation to the characteristic function. Multi-Agent Gain Sharing (MAGS) trains transformer-based agents to negotiate fair gain shares directly in mixed-motive logistics environments, notionally scaling to >1000 concurrent agents (Sun2024Scalable?).
(Chalkiadakis2003Coordination?) treats unknown opponent capabilities as a POMDP and has agents do Bayesian belief updates over opponent types during coalition formation, which lets them anticipate (and sometimes exploit) irrational or static partners (Chalkiadakis2003Coordination?).
Both approaches experience a failure mode called transductivity: agents learn to coordinate with the opponents they trained against, not with arbitrary new ones (which would be inductivity).
2.2 Model-based MARL for stable partitions
Traditionally, MARL is model-free. Model-based MARL — where agents learn (or are handed) a predictive model of environment dynamics — seems rarely used in coalition formation, but it buys us something the model-free version can’t: theoretical stability guarantees.
I found a worked example, in radar-network multitarget tracking (Geng2023Coalition?). Several geographically separated radar stations, each with limited beam-time, jointly track many more moving targets than any one station can cover. Two or three radars pointing at the same target triangulate to a lower-variance estimate than any one alone — that’s the cooperation gain. A coalition is the subset of radars agreeing to illuminate one target; the global partition assigns radars to targets. “Transferable utility” means the joint tracking-quality improvement is a scalar that splits among coalition members. No central control dictates assignments; each radar runs a model-based RL policy that learns the value of joining one coalition versus another, and the authors prove convergence to a Nash-stable partition at which no single radar wants to defect. The model-based part seems important. Nash-stability is a statement about counterfactual defections — for every alternative coalition each radar could switch to, its simulated payoff must be no better than staying put. Model-free MARL only has reliable value estimates along trajectories it actually sampled, so the off-policy counterfactuals are guesses, and the stability claim can’t be certified. A learned dynamics model lets each radar roll out counterfactual predictions along the lines of “what if I retargeted at \(k’\)”, without physically doing so, and check the deviation locally. The pattern (model-constrained dynamics + transferable-utility coalitions + model-based RL → stability proof) probably generalises beyond radar.
2.3 Credit assignment via the nucleolus
A naïve way of doing MARL is to force a grand coalition — all agents share one global reward — and rely on credit-assignment heuristics like COMA (Jakob Foerster, Farquhar, et al. 2018) to back out individual contributions. In asymmetric environments (e.g. few cooperators vs. many adversaries in SMAC) the grand coalition might be suboptimal. (Wang2025Nucleolus?) import the concept of the nucleolus from cooperative game theory into Q-learning and approximate it. For an allocation \(x\), a coalition \(S\)’s excess is \(e(S, x) = v(S) - \sum_{i \in S} x_i\) — the gap between what \(S\) could earn alone and what its members get under \(x\). Positive excess means there are grounds to defect. The nucleolus is the allocation that makes the most-aggrieved coalition as un-aggrieved as possible, then the next, and so on. A nucleolus-Q operator with convergence guarantees lets agents autonomously fracture the grand coalition into smaller, task-specific sub-teams, each stable in the sense that no member wants to defect.
2.4 Opponent shaping
See opponent shaping.
2.5 Noisy observation of partner intent
In a mixed-motive game where every agent is purely selfish, MARL notoriously finds a mutually-defective equilibrium and stays there. One fix is to hard-code prosocial weights — give each agent a reward of the form \(r_i + \alpha \sum_{j \neq i} r_j\), with \(\alpha\) tuned so cooperation pays. That works, apparently, but is not satisfying. It bakes the cooperation in rather than letting it be discovered; it tends to be fragile to defectors, doesn’t generalise off-distribution, and gives no principled story for who should cooperate with whom.
The Randomized Uncertain Social Preferences (RUSP) framework (Baker 2020) solves this by training agents across a distribution of prosocial weights, where each agent only gets a noisy observation of its own weights and no information about others’. At each episode start, we sample a reward-transformation matrix \(W\), where \(W_{ij}\) is the weight agent \(i\) puts on agent \(j\)’s reward, then have agent \(i\)’s shaped reward be \(\sum_j W_{ij} r_j\). Each agent observes only a noisy version of its own row (so it isn’t sure how much it should care about each partner), and nothing about other agents’ rows (so it can’t tell which partners care about it). Training is across the whole distribution of \(W\)s, not a single fixed one.
Because partner intent isn’t observable, the only way an agent can do well in expectation is to read intent from behaviour — cooperate cautiously, escalate cooperation when reciprocated, punish when defected against. The reported emergent phenomena are: direct reciprocity (tit-for-tat-ish responses to recent behaviour), indirect reputation tracking (treating an agent based on how it treated third parties), and stable in-episode team formation (subsets of agents settling into cooperative pairs while the rest go it alone). None of these are programmed in; they’re the policies that generalise across the \(W\)-distribution. Unlike fixed-prosocial-weights training, the policies reportedly transfer to held-out partner mixes without retraining.
2.6 Open-source game theory
MARL is one possible operationalisation of open-source game theory, in which agents exchange policy source code (or some interpretable proxy) before acting and cooperate by mutual verification.
