Incentive alignment problems
What is your loss function? How about mine?
2014-09-22 — 2025-08-21
I would like to discuss “alignment” problems in AI, economic mechanisms, and institutions. But what even is it for such a thing to be aligned?
Many things to unpack. What do we imagine alignment to be when our own goals are themselves a diverse evolutionary epiphenomenon? Is alignment possible? Does everything ultimately Goodhart? Is that the origin of Moloch?
In the AI world, the term “alignment” feels both omnipresent and slippery. We want AI systems to “do what we want,” to pursue our goals, and to not, you know, turn the entire planet into paperclips. But what does it mean, formally, for one agent’s goals to be aligned with another’s?
1 Get me a coffee
Economists have been wrestling with a version of this question for the better part of a century. They call it the principal-agent problem, and their toolkit—mechanism design, game theory, and contract theory—offers a powerful and precise language for thinking about alignment.
Let’s unpack these ideas with a running example.
1.1 The Parable of the Coffee-Fetching Bot
Imagine you are the Principal. You’ve just built a sophisticated autonomous robot, your Agent, whose job is to fetch you the perfect cup of coffee every morning.
Your objective is simple: get a high-quality latte, quickly, and without spending too much. The robot’s objective is whatever you program its “reward function” to be. For an audience used to loss functions, a utility function is just its negative—something to be maximized.
Formally:
- Actions (
): The robot can choose an action from a set of options . For example, . - Your Payoff (
): Your satisfaction is captured by a payoff function, , which you want to maximize. A great latte is worth 10 “utils” to you, an okay one 5, and instant coffee 1. - The Agent’s Utility (
): The robot has its own utility function, , based on the reward signal you give it, which it seeks to maximize.
The economists refer to the formal roles that map an observable outcome to the agent rewards as a contract refers to the formal set of rules that defines the agent’s incentive; AFAICT it is almost perfectly analogous to the reward function. Ideally, you could just write a contract to set the agent’s utility equal to your payoff:
The robot choosing its best option is it choosing our best option. Problem solved! Of course, the world is rarely that simple. The core of the alignment problem arises because you and your agent don’t see or know the same things.
Hidden Actions (Moral Hazard): You can’t monitor the robot perfectly. You tell it to “minimize wait time,” but you only see when it gets back. Did it take an efficient route, or did it slack off to conserve its battery (a goal you didn’t explicitly forbid)? This is the classic moral hazard problem: the principal cannot observe the agent’s effort, only a noisy outcome (Holmström 1979).
Hidden Information (Adverse Selection): The robot knows things you don’t. It might discover that the fancy cafe’s espresso machine is broken. If its reward is just “bring a latte,” it might default to a suboptimal choice from your perspective, when you would have preferred it to just make instant coffee given that new information.
This is where naive instructions fail. We can’t write a contract for every contingency. Instead of specifying the action, we must design the incentives.
1.2 Incentive Compatibility aligning goals, not actions
Another way to get at the core idea is to talk about incentive compatibility (IC). An incentive structure is incentive compatible if an agent, by acting in their own best interest, naturally chooses the action the principal desires. We design the “rules of the game” (the reward function) so that the agent’s selfishness leads to our desired outcome (Laffont and Martimort 2002).
Instead of a simple reward for “bringing coffee,” we might design a reward function based on observable outcomes: temperature, time of delivery, and cost, fair-trade certification etc. The goal is to craft this function so that the robot, in maximizing its expected reward, behaves as we would. A mechanism with this property is prbably what AI people mean when they describe something as “aligned”
1.3 Multiple agents
The principal-agent problem is about a one-to-one relationship. But what if we are a social planner designing a system for many agents, like an auction? Here, the goal is often efficiency: ensuring the goods go to the people who value them most.
A celebrated example is the Vickrey-Clarke-Groves (VCG) auction. This mechanism is designed to be “dominant-strategy incentive compatible” This means that every bidder’s best strategy is to bid their true value, no matter what anyone else bids (Vickrey 1961; Myerson 1981). The mechanism aligns private incentives (winning) with the social goal (efficiency) (Mas-Colell, Whinston, and Green 1995). Stuff gets much weirder for more complicated things than auctions though. See, e.g. multi agent systems and the full machinery of mechanism design.
1.4 “Nearly” aligned
Classical definitions of alignment are often binary. But in AI, we often deal with approximations. The economic formalism gives us a vocabulary for measuring how misaligned something is.
What We’re Trying to Control | How We Measure Misalignment | In Coffee-Bots |
---|---|---|
My Payoff | Payoff Regret: My ideal payoff minus my actual payoff. | My perfect latte is worth 10 utils. The bot brings one worth 8. My regret is 2 utils. |
Agent’s Incentives | IC-Slack: How much extra utility an agent gets from deviating. | The bot could gain 0.1 utils by taking a shortcut that slightly lowers coffee quality. This is the “temptation” to misbehave. |
Multiple Bad Outcomes | Worst-Case Efficiency: Welfare of the worst equilibrium compared to the optimum. | If there are multiple ways the bot can get its reward, what’s the payoff from the worst of those strategies? |
Uncertainty About the Agent | Robustness: How wrong can my model of the agent be before alignment breaks? (Bergemann and Morris 2005) | My reward function works if the bot only cares about battery. What if it also starts caring about seeing pigeons? Does it still work? |
Catastrophic Failures | Distributional Distance: How far is the distribution of outcomes from my ideal one? | e.g. 95% of the bot can get coffee at the shop and there is no problem. What happens the other 5%? Maybe it gives up and makes instant coffee, or maybe it decides MUST GET ESPRESSO AT ALL COSTS and ram-raids a store for an espresso machine. In the latter case, the modal regret regret might be low, but the outcome distribution contains an unacceptable tail risk. We care about more than just the typical result. |
1.5 What economists don’t deal with
The economic framework is powerful, but it’s built on assumptions that can be fragile in the real world.
1.5.1 Do We Really Have Utility Functions?
Rational choice theory assumes that people have stable, consistent preferences they seek to maximize. However, critics argue this oversimplifies human behavior, which is often influenced by emotions, social norms, and cognitive biases rather than pure utility calculation (G. Hodgson 2012). If we, the principals, don’t have a coherent utility function, what exactly are we trying to align the AI to? Do we have utility instead of fitness? Do we take local approximations to emergent goals such as empowerment?
1.5.2 Is the Loss Function the Utility Function?
For an AI, we might think the training loss function is its utility function. But this is often not the case at inference time. An AI trained to achieve a goal in a specific environment may learn a proxy for that goal that fails in a new context. This is goal misgeneralization. An AI rewarded for getting a coin at the end of a video game level might learn the goal “always go right” instead of “get the coin”. When deployed in a new level where the coin is on the left, it will competently pursue the wrong goal. This happens even with a correctly specified reward function, making it a subtle and dangerous form of misalignment.
1.6 For AI Alignment
The economic perspective doesn’t solve AI alignment, but it provides a vocabulary for framing the problem. It forces us to be precise:
- It moves us from vague wishes to clearly defined objectives, actions, and information structures.
- It shows that alignment is not a property of an agent in isolation, but a feature of the system or game we design for it.
- It gives us a candidate quantifications of for “near misses” and catastrophic failures through concepts like regret, robustness, and distributional risk.
Read on at Aligning AI.
2 Alignment without utility
TBD
3 With what is it ultimately possible to align?
Consider deep history of intelligence.
4 AI alignment
The knotty case of superintelligent AI in particular.
5 Incoming
Joe Edelman, Is Anything Worth Maximizing? How metrics shape markets, how we’re doing them wrong
Metrics are how an algorithm or an organisation listens to you. If you want to listen to one person, you can just sit with them and see how they’re doing. If you want to listen to a whole city — a million people — you have to use metrics and analytics
and
What would it be like, if we could actually incentivize what we want out of life? If we incentivized lives well lived.
Goal Misgeneralization: How a Tiny Change Could End Everything
This video explores how YOU, YES YOU, are a case of misalignment with respect to evolution’s implicit optimization objective. We also show an example of goal misgeneralization in a simple AI system, and explore how deceptive alignment shares similar features and may arise in future, far more powerful AI systems.