Value/reward learning
In the case that we assume that that agent over there is doing RL, can we deduce its value function?
2025-06-05 — 2026-04-10
Wherein a reward function is sought to be recovered from observed trajectories, the problem is found to be underdetermined, and its connection to the economist’s revealed preference is noted.
Here is the setup. We are watching some agent — a person, an animal, a robot, a corporation — stumbling around in a world, taking actions, and occasionally looking pleased or displeased with itself. We suspect that this agent has a value function, i.e. that there exists some scalar-valued function \(R\) over states (or state-action pairs, or trajectories) that the agent is, in some sense, optimising. Can we figure out what \(R\) is from watching the agent’s behaviour?
This is the ML actualisation of the economist’s notion of revealed preference — the idea that we can back out what an agent wants from what it does. Economists have been arguing about this since Samuelson (Samuelson 1938), and the arguments are not settled, but the reinforcement learning crowd have not waited for the phenomenological dust to settle, preferring to follow the engineer’s path and just build the damn thing. That has given us a formalism.
Suppose the agent lives in a Markov Decision Process with states \(s \in \mathcal{S}\), actions \(a \in \mathcal{A}\), transition dynamics \(T(s' \mid s, a)\), and discount factor \(\gamma\). The agent follows some policy \(\pi(a \mid s)\). We assume there exists a reward function \(R: \mathcal{S} \times \mathcal{A} \to \mathbb{R}\) such that the agent is behaving (approximately) optimally with respect to the cumulative discounted return \[ G_t = \sum_{k=0}^{\infty} \gamma^k R(s_{t+k}, a_{t+k}). \] We do not observe \(R\). We observe trajectories \(\tau = (s_0, a_0, s_1, a_1, \ldots)\) sampled from \(\pi\) interacting with \(T\). The question is: can we recover \(R\) (or something morally equivalent to \(R\)) from a dataset of such trajectories?
The short answer is “kinda, with caveats.” The longer answer is that the problem is underdetermined — many reward functions can explain the same observed behaviour — and the interesting work is in figuring out what extra structure or assumptions let you pin things down anyway, and what kinds of degeneracy we care about etc. This matters practically (if we want a robot to learn what we want by watching us) and philosophically (if we want to talk about what evolution “wants,” or what a firm is “maximising”).
Let me unpack the main lines of attack.
1 Inverse Reinforcement Learning
The most direct approach. Given a set of demonstrated trajectories from an expert policy \(\pi^*\), find a reward function \(R\) under which \(\pi^*\) is optimal (or near-optimal).
The classic result (Ng and Russell 2000) tells us that the set of reward functions consistent with a given optimal policy is large. In particular, \(R = 0\) always works — if everything is equally rewarding, every policy is optimal. So we need tiebreakers. The standard moves are to prefer reward functions that make the demonstrated policy “much better” than alternatives (maximum margin), or to go Bayesian and put a prior over reward functions (maximum entropy IRL, (Ziebart et al. 2008)).
The maximum entropy approach is especially clean. It assumes the agent follows a Boltzmann-rational policy \[ \pi(a \mid s) \propto \exp\bigl(Q^*(s, a)\bigr) \] where \(Q^*\) is the soft optimal action-value function under \(R\). This turns the inverse problem into a well-posed maximum likelihood estimation: find the \(R\) that maximises the likelihood of the observed trajectories under this softened model of rationality. The “soft” part is doing a lot of work — it lets us accommodate the fact that real agents are noisy, rather than insisting on perfect optimality.
In principle we could recover either the reward function \(R\) or the value function \(V\); in practice, going after \(R\) is more portable, since the value function depends on the dynamics \(T\), whereas \(R\) does not (or at least, need not).
2 Assistance Games
a.k.a. cooperative inverse RL (Hadfield-Menell et al. 2016). This is a multi-agent setting where the twist is that one agent (the robot) is explicitly trying to help another agent (the human) whose reward function is unknown.
The robot doesn’t just passively observe and infer — it actively chooses actions that are informative about the human’s reward, while simultaneously being useful given its current best guess. This turns value learning from a pure estimation problem into a sequential decision problem under uncertainty about the objective. The elegant bit is that the robot has an incentive to be humble: if it’s uncertain about what the human wants, the optimal policy is to defer, ask, or act conservatively, rather than barrel ahead with a confident but wrong guess. This is one of the AI safety framings that naturally produces cautious, corrigible behaviour without having to bolt it on as a constraint.
