Assumed audience:

AI people who think the definition of ‘Alignment’ might be intuitive

Figure 1

I would like to discuss “alignment” problems in AI, economic mechanisms, and institutions. But what even is it for such a thing to be aligned?

Many things to unpack. What do we imagine alignment to be when our own goals are themselves a diverse evolutionary epiphenomenon? Is alignment possible? Does everything ultimately Goodhart? Is that the origin of Moloch?

In the AI world, the term “alignment” feels both omnipresent and slippery. We want AI systems to “do what we want,” to pursue our goals, and to not, you know, turn the entire planet into paperclips. But what does it mean, formally, for one agent’s goals to be aligned with another’s?

1 Get me a coffee

Economists have been wrestling with a version of this question for the better part of a century. They call it the principal-agent problem, and their toolkit—mechanism design, game theory, and contract theory—offers a powerful and precise language for thinking about alignment.

Let’s unpack these ideas with a running example.

1.1 The Parable of the Coffee-Fetching Bot

Imagine you are the Principal. You’ve just built a sophisticated autonomous robot, your Agent, whose job is to fetch you the perfect cup of coffee every morning.

Your objective is simple: get a high-quality latte, quickly, and without spending too much. The robot’s objective is whatever you program its “reward function” to be. For an audience used to loss functions, a utility function is just its negative—something to be maximized.

Formally:

  • Actions (a): The robot can choose an action a from a set of options A. For example, A={Go to Starbucks,Go to local cafe,Brew instant}.
  • Your Payoff (Π): Your satisfaction is captured by a payoff function, Π(a), which you want to maximize. A great latte is worth 10 “utils” to you, an okay one 5, and instant coffee 1.
  • The Agent’s Utility (U): The robot has its own utility function, U(a), based on the reward signal you give it, which it seeks to maximize.

The economists refer to the formal roles that map an observable outcome to the agent rewards as a contract refers to the formal set of rules that defines the agent’s incentive; AFAICT it is almost perfectly analogous to the reward function. Ideally, you could just write a contract to set the agent’s utility equal to your payoff: U(a)=Π(a). If so, the action that maximizes the robot’s utility would, by definition, also maximize yours. This is perfect alignment:

argmaxaAU(a)=argmaxaAΠ(a)

The robot choosing its best option is it choosing our best option. Problem solved! Of course, the world is rarely that simple. The core of the alignment problem arises because you and your agent don’t see or know the same things.

  1. Hidden Actions (Moral Hazard): You can’t monitor the robot perfectly. You tell it to “minimize wait time,” but you only see when it gets back. Did it take an efficient route, or did it slack off to conserve its battery (a goal you didn’t explicitly forbid)? This is the classic moral hazard problem: the principal cannot observe the agent’s effort, only a noisy outcome ().

  2. Hidden Information (Adverse Selection): The robot knows things you don’t. It might discover that the fancy cafe’s espresso machine is broken. If its reward is just “bring a latte,” it might default to a suboptimal choice from your perspective, when you would have preferred it to just make instant coffee given that new information.

This is where naive instructions fail. We can’t write a contract for every contingency. Instead of specifying the action, we must design the incentives.

1.2 Incentive Compatibility aligning goals, not actions

Another way to get at the core idea is to talk about incentive compatibility (IC). An incentive structure is incentive compatible if an agent, by acting in their own best interest, naturally chooses the action the principal desires. We design the “rules of the game” (the reward function) so that the agent’s selfishness leads to our desired outcome ().

Instead of a simple reward for “bringing coffee,” we might design a reward function based on observable outcomes: temperature, time of delivery, and cost, fair-trade certification etc. The goal is to craft this function so that the robot, in maximizing its expected reward, behaves as we would. A mechanism with this property is prbably what AI people mean when they describe something as “aligned”

1.3 Multiple agents

The principal-agent problem is about a one-to-one relationship. But what if we are a social planner designing a system for many agents, like an auction? Here, the goal is often efficiency: ensuring the goods go to the people who value them most.

A celebrated example is the Vickrey-Clarke-Groves (VCG) auction. This mechanism is designed to be “dominant-strategy incentive compatible” This means that every bidder’s best strategy is to bid their true value, no matter what anyone else bids (; ). The mechanism aligns private incentives (winning) with the social goal (efficiency) (). Stuff gets much weirder for more complicated things than auctions though. See, e.g. multi agent systems and the full machinery of mechanism design.

1.4 “Nearly” aligned

Classical definitions of alignment are often binary. But in AI, we often deal with approximations. The economic formalism gives us a vocabulary for measuring how misaligned something is.

What We’re Trying to Control How We Measure Misalignment In Coffee-Bots
My Payoff Payoff Regret: My ideal payoff minus my actual payoff. My perfect latte is worth 10 utils. The bot brings one worth 8. My regret is 2 utils.
Agent’s Incentives IC-Slack: How much extra utility an agent gets from deviating. The bot could gain 0.1 utils by taking a shortcut that slightly lowers coffee quality. This is the “temptation” to misbehave.
Multiple Bad Outcomes Worst-Case Efficiency: Welfare of the worst equilibrium compared to the optimum. If there are multiple ways the bot can get its reward, what’s the payoff from the worst of those strategies?
Uncertainty About the Agent Robustness: How wrong can my model of the agent be before alignment breaks? () My reward function works if the bot only cares about battery. What if it also starts caring about seeing pigeons? Does it still work?
Catastrophic Failures Distributional Distance: How far is the distribution of outcomes from my ideal one? e.g. 95% of the bot can get coffee at the shop and there is no problem. What happens the other 5%? Maybe it gives up and makes instant coffee, or maybe it decides MUST GET ESPRESSO AT ALL COSTS and ram-raids a store for an espresso machine. In the latter case, the modal regret regret might be low, but the outcome distribution contains an unacceptable tail risk. We care about more than just the typical result.

1.5 What economists don’t deal with

The economic framework is powerful, but it’s built on assumptions that can be fragile in the real world.

1.5.1 Do We Really Have Utility Functions?

Rational choice theory assumes that people have stable, consistent preferences they seek to maximize. However, critics argue this oversimplifies human behavior, which is often influenced by emotions, social norms, and cognitive biases rather than pure utility calculation (). If we, the principals, don’t have a coherent utility function, what exactly are we trying to align the AI to? Do we have utility instead of fitness? Do we take local approximations to emergent goals such as empowerment?

1.5.2 Is the Loss Function the Utility Function?

For an AI, we might think the training loss function is its utility function. But this is often not the case at inference time. An AI trained to achieve a goal in a specific environment may learn a proxy for that goal that fails in a new context. This is goal misgeneralization. An AI rewarded for getting a coin at the end of a video game level might learn the goal “always go right” instead of “get the coin”. When deployed in a new level where the coin is on the left, it will competently pursue the wrong goal. This happens even with a correctly specified reward function, making it a subtle and dangerous form of misalignment.

1.6 For AI Alignment

The economic perspective doesn’t solve AI alignment, but it provides a vocabulary for framing the problem. It forces us to be precise:

  • It moves us from vague wishes to clearly defined objectives, actions, and information structures.
  • It shows that alignment is not a property of an agent in isolation, but a feature of the system or game we design for it.
  • It gives us a candidate quantifications of for “near misses” and catastrophic failures through concepts like regret, robustness, and distributional risk.

Read on at Aligning AI.

2 Alignment without utility

TBD

3 With what is it ultimately possible to align?

Consider deep history of intelligence.

4 AI alignment

The knotty case of superintelligent AI in particular.

5 Incoming

  • Joe Edelman, Is Anything Worth Maximizing? How metrics shape markets, how we’re doing them wrong

    Metrics are how an algorithm or an organisation listens to you. If you want to listen to one person, you can just sit with them and see how they’re doing. If you want to listen to a whole city — a million people — you have to use metrics and analytics

    and

    What would it be like, if we could actually incentivize what we want out of life? If we incentivized lives well lived.

  • Goal Misgeneralization: How a Tiny Change Could End Everything

    This video explores how YOU, YES YOU, are a case of misalignment with respect to evolution’s implicit optimization objective. We also show an example of goal misgeneralization in a simple AI system, and explore how deceptive alignment shares similar features and may arise in future, far more powerful AI systems.

6 References

Aguirre, Dempsey, Surden, et al. 2020. AI Loyalty: A New Paradigm for Aligning Stakeholder Interests.” IEEE Transactions on Technology and Society.
Aktipis. 2016. Principles of Cooperation Across Systems: From Human Sharing to Multicellularity and Cancer.” Evolutionary Applications.
Bergemann, and Morris. 2005. Robust Mechanism Design.” Econometrica.
Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies.
Conitzer, and Oesterheld. 2023. Foundations of Cooperative AI.” Proceedings of the AAAI Conference on Artificial Intelligence.
Crawford, and Sobel. 1982. Strategic Information Transmission.” Econometrica: Journal of the Econometric Society.
Critch, Dennis, and Russell. 2022. Cooperative and Uncooperative Institution Designs: Surprises and Problems in Open-Source Game Theory.”
Dasgupta, and Ghosh. 2013. “Crowdsourced Judgement Elicitation with Endogenous Proficiency.” In Proceedings of the 22nd International World Wide Web Conference (WWW).
Daskalakis, Deckelbaum, and Tzamos. 2013. Mechanism Design via Optimal Transport.” In.
Duque, Aghajohari, Cooijmans, et al. 2024. Advantage Alignment Algorithms.” In.
Ecoffet, and Lehman. 2021. Reinforcement Learning Under Moral Uncertainty.” In Proceedings of the 38th International Conference on Machine Learning.
Foss. 2000. The Theory of the Firm: Critical Perspectives on Business and Management.
Fudenberg, and Tirole. 1991. Game Theory.
Gneiting, and Raftery. 2007. Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association.
Guha, Lawrence, Gailmard, et al. 2023. AI Regulation Has Its Own Alignment Problem: The Technical and Institutional Feasibility of Disclosure, Registration, Licensing, and Auditing.” George Washington Law Review, Forthcoming.
Hadfield-Menell, and Hadfield. 2018. Incomplete Contracting and AI Alignment.”
Hodgson, Geoffrey. 2012. On the Limits of Rational Choice Theory.” Economic Thought.
Hodgson, Geoffrey M. 2012. From Pleasure Machines to Moral Communities: An Evolutionary Economics Without Homo Economicus.
Holmström. 1979. Moral Hazard and Observability.” The Bell Journal of Economics.
Hurwicz, and Reiter. 2006. Designing Economic Mechanisms.
Hutson. 2022. Taught to the Test.” Science.
Jackson. 2014. Mechanism Theory.” SSRN Scholarly Paper ID 2542983.
Kamenica, and Gentzkow. 2011. “Bayesian Persuasion.” American Economic Review.
Korinek, Fellow, Balwit, et al. n.d. “Direct and Social Goals for AI Systems.”
Laffont, and Martimort. 2002. The Theory of Incentives: The Principal-Agent Model.
Lambrecht, and Myers. 2017. The Dynamics of Investment, Payout and Debt.” The Review of Financial Studies.
Manheim, and Garrabrant. 2019. Categorizing Variants of Goodhart’s Law.”
Mas-Colell, Whinston, and Green. 1995. Microeconomic Theory.
Maskin. 1999. Nash Equilibrium and Welfare Optimality.” The Review of Economic Studies.
McFadden. 1974. “Conditional Logit Analysis of Qualitative Choice Behavior.” Edited by Paul Zarembka. Frontiers in Econometrics.
McKelvey, and Palfrey. 1995. Quantal Response Equilibria for Normal Form Games.” Games and Economic Behavior.
Miller, Resnick, and Zeckhauser. 2005. “Eliciting Informative Feedback: The Peer-Prediction Method.” Management Science.
Myerson. 1981. Optimal Auction Design.” Mathematics of Operations Research.
Naudé. 2022. The Future Economics of Artificial Intelligence: Mythical Agents, a Singleton and the Dark Forest.” IZA Discussion Papers, IZA Discussion Papers,.
Ngo, Chan, and Mindermann. 2024. The Alignment Problem from a Deep Learning Perspective.”
Nowak. 2006. Five Rules for the Evolution of Cooperation.” Science.
Omohundro. 2008. The Basic AI Drives.” In Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference.
Prelec. 2004. A Bayesian Truth Serum for Subjective Data.” Science.
Prelec, Seung, and McCoy. 2017. A Solution to the Single-Question Crowd Wisdom Problem.” Nature.
Radanovic, and Faltings. 2013. A Robust Bayesian Truth Serum for Non-Binary Signals.” In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. AAAI’13.
Ringstrom. 2022. Reward Is Not Necessary: How to Create a Compositional Self-Preserving Agent for Life-Long Learning.”
Russell. 2019. Human Compatible: Artificial Intelligence and the Problem of Control.
Silver, Singh, Precup, et al. 2021. Reward Is Enough.” Artificial Intelligence.
Stiglitz. 1989. Markets, Market Failures, and Development.” American Economic Review.
Taylor, Yudkowsky, LaVictoire, et al. 2020. Alignment for Advanced Machine Learning Systems.” In Ethics of Artificial Intelligence.
Vickrey. 1961. Counterspeculation, Auctions, and Competitive Sealed Tenders.” The Journal of Finance.
Witkowski, and Parkes. 2012a. Peer Prediction Without a Common Prior.” In Proceedings of the 13th ACM Conference on Electronic Commerce (EC).
———. 2012b. A Robust Bayesian Truth Serum for Small Populations.” In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. AAAI’12.
Xu, and Dean. 2023. Decision-Aid or Controller? Steering Human Decision Makers with Algorithms.”
Zhuang, and Hadfield-Menell. 2021. Consequences of Misaligned AI.”