Incentive alignment problems

What is your loss function? How about mine?

2014-09-22 — 2025-08-21

Wherein the reader is introduced to principal–agent paradoxes via a coffee‑fetching robot, and incentive‑compatible mechanisms from contract theory and VCG auctions are expounded.

adversarial

distributed

economics

extended self

faster pussycat

game theory

incentive mechanisms

institutions

networks

security

tail risk

Assumed audience:

AI people who think the definition of ‘Alignment’ might be intuitive

I’d like to discuss “alignment” problems in AI, economic mechanisms, and institutions. But what even does it mean for such a thing to be aligned?

There are many things to unpack. What do we imagine alignment to be when our own goals are themselves a diverse evolutionary epiphenomenon? Is alignment possible? Does everything ultimately succumb to Goodhart? Is that the origin of Moloch?

In the AI world, the term “alignment” feels both omnipresent and slippery. We want AI systems to “do what we want,” to pursue our goals, and to not, you know, turn the entire planet into paperclips. But what does it mean, formally, for one agent’s goals to be aligned with another’s?

1 Get me a coffee

Economists have been wrestling with a version of this question for the better part of a century. They call it the principal-agent problem, and their toolkit—mechanism design, game theory, and contract theory—offers a powerful and precise language for thinking about alignment.

Let’s unpack these ideas with a running example.

1.1 The Coffee-Fetching Bot

Imagine you are the Principal. You’ve just built a sophisticated autonomous robot, your Agent, whose job is to fetch you the perfect cup of coffee every morning.

Your objective is simple: get a high-quality latte, quickly, and without spending too much. The robot’s objective is whatever you program its “reward function” to be. For an audience used to loss functions, a utility function is just the negative of a loss—something to be maximized.

Formally:

Actions (\(a\)): The robot can choose an action \(a\) from a set of options \(A\). For example, \(A = \{\text {Go to Starbucks}, \text {Go to local cafe}, \text {Brew instant}\}\).
Your Payoff (\(\Pi\)): Your satisfaction is captured by a payoff function, \(\Pi (a)\), which you want to maximize. A great latte is worth 10 “utils” to you, an okay one 5, and instant coffee 1.
The Agent’s Utility (\(U\)): The robot has its own utility function, \(U (a)\), based on the reward signal you give it, which it seeks to maximize.

Economists call the formal rules that map observable outcomes to agent rewards a contract; as far as I can tell, it’s almost perfectly analogous to a reward function. Ideally, you could just write a contract to set the agent’s utility equal to your payoff: \(U (a) = \Pi (a)\). If so, the action that maximizes the robot’s utility would, by definition, also maximize yours. This is perfect alignment: \[ \arg\max_{a \in A} U(a) = \arg\max_{a \in A} \Pi(a) \] The robot choosing its best option — is it choosing our best option? Problem solved! Of course, the world is rarely that simple. The core of the alignment problem arises because you and your agent don’t see or know the same things.

Hidden Actions (Moral Hazard): You can’t monitor the robot perfectly. You tell it to “minimize wait time,” but you only see when it gets back. Did it take an efficient route, or did it slack off to conserve its battery (a goal you didn’t explicitly forbid)? This is the classic moral hazard problem: the principal can’t observe the agent’s effort, only a noisy outcome (Holmström 1979).
Hidden Information (Adverse Selection): The robot knows things you don’t. It might discover that the fancy cafe’s espresso machine is broken. If its reward is just “bring a latte,” it might default to a suboptimal choice from your perspective, when you would have preferred it to just make instant coffee given that new information.

This is where naive instructions fail. We can’t write a contract for every contingency. Instead of specifying the action, we must design the incentives.

1.2 Incentive Compatibility

Another way to get at the core idea is to talk about incentive compatibility (IC). An incentive structure is incentive compatible if an agent, by acting in their own best interest, naturally chooses the action the principal desires. We design the “rules of the game” (the reward function) so that the agent’s selfishness leads to our desired outcome (Laffont and Martimort 2002).

Instead of a simple reward for “bringing coffee,” we might design a reward function based on observable outcomes: temperature, time of delivery, and cost, fair-trade certification etc. The goal is to craft this function so that the robot, in maximizing its expected reward, behaves as we would. A mechanism with this property is probably what AI people mean when they describe something as “aligned”.

1.3 Multiple agents

The principal-agent problem is about a one-to-one relationship. What if we are a social planner designing a system for many agents? Here, the goal is often efficiency: ensuring the goods go to the people who value them most.

A celebrated example is the Vickrey-Clarke-Groves (VCG) auction. This mechanism is designed to be “dominant-strategy incentive compatible.” This means that every bidder’s best strategy is to bid their true value, no matter what anyone else bids (Vickrey 1961; Myerson 1981). The mechanism aligns private incentives (winning) with the social goal (efficiency) (Mas-Colell, Whinston, and Green 1995). Stuff gets much weirder for problems more complicated than auctions, though. See, e.g. multi agent systems and the full machinery of mechanism design.

1.4 “Nearly” aligned

Classical definitions of alignment are often binary. But in AI, we often deal with approximations. The economic formalism gives us a vocabulary for measuring how misaligned something is.

What We’re Trying to Control	How We Measure Misalignment	In Coffee-Bots
My Payoff	Payoff Regret: My ideal payoff minus my actual payoff.	My perfect latte is worth 10 utils. The bot brings one worth 8. My regret is 2 utils.
Agent’s Incentives	IC-Slack: How much extra utility an agent gets from deviating.	The bot could gain 0.1 utils by taking a shortcut that slightly lowers coffee quality. This is the “temptation” to misbehave.
Multiple Bad Outcomes	Worst-Case Efficiency: Welfare of the worst equilibrium compared to the optimum.	If there are multiple ways the bot can get its reward, what’s the payoff from the worst of those strategies?
Uncertainty About the Agent	Robustness: How wrong can my model of the agent be before alignment breaks? (Bergemann and Morris 2005)	My reward function works if the bot only cares about its battery. What if it also starts caring about seeing pigeons? Does it still work?
Catastrophic Failures	Distributional Distance: How far is the distribution of outcomes from my ideal one?	e.g. 95% of the time the bot can get coffee at the shop and there is no problem. What happens the other 5%? Maybe it gives up and makes instant coffee, or maybe it decides MUST GET ESPRESSO AT ALL COSTS and ram-raids a store for an espresso machine. In the latter case, the modal regret might be low, but the outcome distribution contains an unacceptable tail risk. We care about more than just the typical result.

1.5 Missing pieces of economics

I am no economist but there are a few things that seem missing from this picture.

1.5.1 Do We Really Have Utility Functions?

Rational choice theory assumes that people have stable, consistent preferences they seek to maximize. However, critics argue this oversimplifies human behaviour, which is often influenced by emotions, social norms, and cognitive biases rather than pure utility calculation (G. Hodgson 2012). If we, the principals, don’t have a coherent utility function, what exactly are we trying to align the AI to? Do we have utility instead of fitness? Do we take local approximations to emergent goals such as empowerment?

1.5.2 Is the Loss Function the Utility Function?

For an AI, we might think the training loss function is its utility function. But this is often not the case at inference time. An AI trained to achieve a goal in a specific environment may learn a proxy for that goal that fails in a new context. This is goal misgeneralization. An AI rewarded for getting a coin at the end of a video game level might learn the goal “always go right” instead of “get the coin”. When deployed in a new level where the coin is on the left, it will competently pursue the wrong goal. This happens even with a correctly specified reward function, making it a subtle and dangerous form of misalignment.

1.6 For AI Alignment

The economic perspective doesn’t solve AI alignment, but it provides a vocabulary for framing the problem. It forces us to be precise:

It moves us from vague wishes to clearly defined objectives, actions, and information structures.
It shows that alignment is not a property of an agent in isolation, but a feature of the system or game we design for it.
It gives us a candidate quantification for “near misses” and catastrophic failures through concepts like regret, robustness, and distributional risk.

Read on at Aligning AI.

2 Alignment without utility

TBD

3 With what is it ultimately possible to align?

Consider deep history of intelligence.

4 AI alignment

The knotty case of superintelligent AI in particular.

5 Incoming

Joe Edelman, Is Anything Worth Maximizing? How metrics shape markets, how we’re doing them wrong

Metrics are how an algorithm or an organization listens to you. If you want to listen to one person, you can just sit with them and see how they’re doing. If you want to listen to a whole city — a million people — you have to use metrics and analytics

and

What would it be like, if we could actually incentivize what we want out of life? If we incentivized lives well lived.
Goal Misgeneralization: How a Tiny Change Could End Everything

This video explores how YOU, YES YOU, are a case of misalignment with respect to evolution’s implicit optimization objective. We also show an example of goal misgeneralization in a simple AI system, and explore how deceptive alignment shares similar features and may arise in future, far more powerful AI systems.

6 References

Aguirre, Dempsey, Surden, et al. 2020. “AI Loyalty: A New Paradigm for Aligning Stakeholder Interests.” IEEE Transactions on Technology and Society.

Aktipis. 2016. “Principles of Cooperation Across Systems: From Human Sharing to Multicellularity and Cancer.” Evolutionary Applications.

Bergemann, and Morris. 2005. “Robust Mechanism Design.” Econometrica.

Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies.

Conitzer, and Oesterheld. 2023. “Foundations of Cooperative AI.” Proceedings of the AAAI Conference on Artificial Intelligence.

Crawford, and Sobel. 1982. “Strategic Information Transmission.” Econometrica: Journal of the Econometric Society.

Critch, Dennis, and Russell. 2022. “Cooperative and Uncooperative Institution Designs: Surprises and Problems in Open-Source Game Theory.”

Dasgupta, and Ghosh. 2013. “Crowdsourced Judgement Elicitation with Endogenous Proficiency.” In Proceedings of the 22nd International World Wide Web Conference (WWW).

Daskalakis, Deckelbaum, and Tzamos. 2013. “Mechanism Design via Optimal Transport.” In.

Duque, Aghajohari, Cooijmans, et al. 2025. “Advantage Alignment Algorithms.” In.

Ecoffet, and Lehman. 2021. “Reinforcement Learning Under Moral Uncertainty.” In Proceedings of the 38th International Conference on Machine Learning.

Foss. 2000. The Theory of the Firm: Critical Perspectives on Business and Management.

Fudenberg, and Tirole. 1991. Game Theory.

Gneiting, and Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association.

Guha, Lawrence, Gailmard, et al. 2023. “AI Regulation Has Its Own Alignment Problem: The Technical and Institutional Feasibility of Disclosure, Registration, Licensing, and Auditing.” George Washington Law Review, Forthcoming.

Hadfield-Menell, and Hadfield. 2018. “Incomplete Contracting and AI Alignment.”

Hodgson, Geoffrey. 2012. “On the Limits of Rational Choice Theory.” Economic Thought.

Hodgson, Geoffrey M. 2012. From Pleasure Machines to Moral Communities: An Evolutionary Economics Without Homo Economicus.

Holmström. 1979. “Moral Hazard and Observability.” The Bell Journal of Economics.

Hurwicz, and Reiter. 2006. Designing Economic Mechanisms.

Hutson. 2022. “Taught to the Test.” Science.

Jackson. 2014. “Mechanism Theory.” SSRN Scholarly Paper ID 2542983.

Kamenica, and Gentzkow. 2011. “Bayesian Persuasion.” American Economic Review.

Korinek, Fellow, Balwit, et al. n.d. “Direct and Social Goals for AI Systems.”

Laffont, and Martimort. 2002. The Theory of Incentives: The Principal-Agent Model.

Lambrecht, and Myers. 2017. “The Dynamics of Investment, Payout and Debt.” The Review of Financial Studies.

Manheim, and Garrabrant. 2019. “Categorizing Variants of Goodhart’s Law.”

Mas-Colell, Whinston, and Green. 1995. Microeconomic Theory.

Maskin. 1999. “Nash Equilibrium and Welfare Optimality.” The Review of Economic Studies.

McFadden. 1974. “Conditional Logit Analysis of Qualitative Choice Behavior.” Edited by Paul Zarembka. Frontiers in Econometrics.

McKelvey, and Palfrey. 1995. “Quantal Response Equilibria for Normal Form Games.” Games and Economic Behavior.

Miller, Resnick, and Zeckhauser. 2005. “Eliciting Informative Feedback: The Peer-Prediction Method.” Management Science.

Myerson. 1981. “Optimal Auction Design.” Mathematics of Operations Research.

Naudé. 2022. “The Future Economics of Artificial Intelligence: Mythical Agents, a Singleton and the Dark Forest.” IZA Discussion Papers, IZA Discussion Papers,.

Ngo, Chan, and Mindermann. 2024. “The Alignment Problem from a Deep Learning Perspective.”

Nowak. 2006. “Five Rules for the Evolution of Cooperation.” Science.

Omohundro. 2008. “The Basic AI Drives.” In Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference.

Prelec. 2004. “A Bayesian Truth Serum for Subjective Data.” Science.

Prelec, Seung, and McCoy. 2017. “A Solution to the Single-Question Crowd Wisdom Problem.” Nature.

Radanovic, and Faltings. 2013. “A Robust Bayesian Truth Serum for Non-Binary Signals.” In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. AAAI’13.

Ringstrom. 2022. “Reward Is Not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning.”

Russell. 2019. Human Compatible: Artificial Intelligence and the Problem of Control.

Silver, Singh, Precup, et al. 2021. “Reward Is Enough.” Artificial Intelligence.

Stiglitz. 1989. “Markets, Market Failures, and Development.” American Economic Review.

Taylor, Yudkowsky, LaVictoire, et al. 2020. “Alignment for Advanced Machine Learning Systems.” In Ethics of Artificial Intelligence.

Vickrey. 1961. “Counterspeculation, Auctions, and Competitive Sealed Tenders.” The Journal of Finance.

Witkowski, and Parkes. 2012a. “Peer Prediction Without a Common Prior.” In Proceedings of the 13th ACM Conference on Electronic Commerce (EC).

———. 2012b. “A Robust Bayesian Truth Serum for Small Populations.” In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. AAAI’12.

Xu, and Dean. 2023. “Decision-Aid or Controller? Steering Human Decision Makers with Algorithms.”

Zhuang, and Hadfield-Menell. 2021. “Consequences of Misaligned AI.”