Intrinsic motivation
Do agents learn to want freedom? Do they learn to want to learn? Do they learn to goof off? etc
2022-11-27 — 2026-04-10
Wherein several formalisms for self-generated learning signals are surveyed, and each is found upon inspection to reduce to a rearrangement of mutual information between past experience and future states.
Intrinsic motivations in machine learning are ones other than the default give the agent a reward function which encodes what we want process. Instead of relying on a sparse external reward signal, an intrinsically motivated agent manufactures its own incentives to act. These are heuristics for “what’s worth doing” when the task is unclear, the reward is delayed, or there may not be any reward at all.
Why care? Several reasons. Maybe we want to devise an open-ended learning algorithm that doesn’t stall the moment it runs out of curriculum. Maybe we want to understand what “interesting” means formally — why do babies poke at things, why do scientists run experiments, and can we get a robot to do something similar? Maybe we are worried about AI safety, and want to know what an agent will do when left to its own devices — because the answer turns out to involve power-seeking in ways that should concern us. Or maybe we’ve noticed that the explore/exploit trade-off in adaptive experiment design is suspiciously similar to curiosity, and we want to know if the same maths is hiding underneath.
It turns out there are several formalizations of intrinsic motivation in the literature, and most of them boil down to choosing which information-theoretic quantity to optimise:
- Empowerment: In the technical sense, maximise action–future mutual information. In the metaphorical sense, keep your world malleable and avoid dead ends.
- Curiosity / novelty: seek out states that reduce uncertainty or maximise prediction error (Schmidhuber 2010; Du et al. 2023). This is the “learn what you don’t know yet” drive.
- Quality–diversity / novelty search: abandon extrinsic benchmarks altogether and reward the discovery of new behaviours, regardless of “performance” (Lehman and Stanley 2011). (Hmm, how is this “novelty search” different from the previous “curiosity / novelty”?)
- Play: generate behaviours with no immediate external payoff, but which enrich the agent’s behavioural repertoire and skill base. In humans, play scaffolds learning. In agents, it can be a way of stumbling into competence.
- Interactivity: maximise the algorithmic information of future behaviour conditioned on past experience (Lewandowski et al. 2025).
- Just stay alive: This is part of what evolution seems to do somehow.
All of these function as internal reward surrogates. They are not tied to a final task, but they shape the learning trajectory so that when tasks arrive, the agent is already robust, exploratory, and resourceful.
This makes intrinsic motivation a step between the two paradigms: optimizers with a fixed loss, and replicators with no fixed loss but an imperative to persist. Intrinsic drives seem “messy” in the same way life is messy: they don’t guarantee that the agent is always doing the right thing, and are a noisy proxy for “good things”.
What follows is my attempt to sketch the major formalisations. The landscape is large but — spoiler alert — most of these proposals boil down to a handful of information-theoretic quantities, applied in slightly different places. I don’t claim to be exhaustive here; I am just trying to give enough of the maths to orient yourself and enough of the intuition to know where to dig.
1 Curiosity as compression progress (Schmidhuber)
The OG formalisation is Schmidhuber’s theory of creativity and fun (Schmidhuber 2010), which has been evolving since 1990. The core idea: our agent maintains an adaptive world model (a predictor or compressor \(p\)) and gets intrinsic reward proportional to the improvement in that model’s performance.
Let \(C(p, h(\leq t))\) be some quality measure of predictor \(p\) evaluated on history \(h\) up to time \(t\). The “intrinsic” reward at time \(t+1\) is
\[r_{\text{int}}(t+1) = f\bigl[C(p(t), h(\leq t+1)),\; C(p(t+1), h(\leq t+1))\bigr]\]
where \(f(a,b) = a - b\) is the simplest choice: how much better did the model get? The RL controller then maximises expected future intrinsic reward, i.e. expected future learning progress.
This neatly sidesteps the “white noise” trap that kills naive curiosity-as-surprise. An agent rewarded by raw prediction error will get stuck staring at a TV tuned to static — high surprise, zero learnability. But compression progress on white noise is zero, because the model can’t improve. So the agent is motivated to seek out data that is currently surprising and learnable — the sweet spot between boredom and confusion. AFAICT this is the phenomenon that the whole intrinsic motivation literature is trying to capture.
Schmidhuber also argues this mechanism explains aspects of humour, art, and scientific curiosity: the punch line of a joke is a moment of rapid compression progress, and a beautiful proof is one that suddenly makes a large body of facts more compressible. Whether you buy that as a full theory of aesthetics is up to you, but as a generative principle for exploration it is hard to beat.
There are several practical variants. In the earliest (1990) version, intrinsic reward is proportional to the prediction error of an RNN world model. A 1991 refinement rewards not the error itself but its first derivative — the change in prediction reliability, measured by a separate “confidence network.” A 1995 version uses the KL divergence between the predictor’s prior and posterior as the curiosity signal:
\[r_{\text{int}}(t) \propto D_{\text{KL}}\!\bigl[p(\cdot \mid h(\leq t)) \;\|\; p(\cdot \mid h(< t))\bigr]\]
which is just information gain — another measure of learning progress, and the connection to Huffman coding and saved bits is immediate.
Schmidhuber’s theory is a very Schmidhuber theory, which is to say, he did in fact come up with it first, but his early attempt is a haphazard mess that it took subsequent researchers a while to either distil or rediscover.
2 Predictive information (Bialek, Nemenman, Tishby)
This one comes from physics rather than AI. Bialek, Nemenman, and Tishby (2001b) define the predictive information \(I_{\text{pred}}(T)\) — the mutual information between the past (a window of duration \(T\)) and the entire future of a time series:
\[I_{\text{pred}}(T) = I(x_{\text{past}};\, x_{\text{future}}) = S(T) + S(T’) - S(T + T’)\]
where \(S(T)\) is the entropy of observations over a window of length \(T\), and we take \(T’ \to \infty\).
The trick is that entropy is extensive (\(S(T) \approx S_0 T\) for large \(T\)), so the predictive information is entirely determined by the “subextensive” corrections \(S_1(T) = S(T) - S_0 T\). Predictability is a deviation from extensivity. Most of the information we collect over time is irrelevant to prediction — a law of diminishing returns for observation.
The growth rate of \(I_{\text{pred}}(T)\) classifies the complexity of the underlying process. For a process with a finite number of learnable parameters, \(I_{\text{pred}}(T) \sim \frac{d}{2}\log T\) where \(d\) is the effective model dimension. For nonparametric processes (continuous functions with smoothness constraints), you get power-law growth \(I_{\text{pred}}(T) \sim T^\alpha\) with \(0 < \alpha < 1\). These are different “universality classes” of learnability, in a stat-mech, or comp mech sense of the word.
Now! What does this buy us for intrinsic motivation? Well, if an agent wants to seek out interesting environments, it might choose those where \(I_{\text{pred}}\) is neither zero (boring, fully predictable) nor maximal (random noise), but growing at an intermediate rate — environments rich enough to keep learning from, but structured enough that learning actually works. This is an information-theoretic version of Schmidhuber’s “sweet spot,” arrived at from a completely different direction.
3 Information-theoretic curiosity in RL (Still and Precup)
Still and Precup (2012) bring predictive information directly into the RL objective.
Standard MDP setup: agent observes states \(x_t \in \mathbf{X}\), takes actions \(a_t \in \mathbf{A}\), accumulates discounted reward. A policy \(\pi(a|x)\) has an associated action-value function \(Q^\pi(x,a)\) and an expected return \(V^\pi\). So far, totally normal.
The idiosyncratic move is to add a complexity penalty on the policy itself. They view the policy as a lossy compression of the state into actions (rate-distortion theory), and penalise the mutual information \(I^\pi(A, X)\) between actions and states:
\[\min_\pi\; I^\pi(A, X) \quad \text{subject to}\quad V^\pi = \text{const.}\]
Among all policies achieving the same expected return, prefer the simplest one — the one that uses the least information about the state to choose its actions. The solution falls out as a Boltzmann-style policy:
\[\pi_{\text{opt}}(a \mid x) = \frac{p^\pi(a)}{Z(x)} \exp\!\Bigl[\tfrac{1}{\lambda} Q^\pi(x,a)\Bigr]\]
where \(\lambda\) is a temperature parameter trading off return against policy complexity, and \(p^\pi(a)\) is the marginal action distribution. This looks like standard Boltzmann exploration, but there is an extra “complexity penalty” term \(\log p^\pi(a)\) that favours actions the agent already tends to take.
Now, to get curiosity, they add a second objective: maximise the predictive power of the agent’s behaviour, measured as the mutual information between the current state-action pair and the next state, \(I(\{X_t, A_t\}; X_{t+1})\). The optimal policy becomes:
\[\pi_{\text{opt}}(a \mid x) \propto p^\pi(a) \exp\!\Bigl[\tfrac{1}{\lambda}\bigl(D_{\text{KL}}[p(X_{t+1}|x,a) \| p^\pi(X_{t+1})] + \alpha\, Q^\pi(x,a)\bigr)\Bigr]\]
The first term in the exponent drives exploration: prefer actions whose consequences are maximally informative about the next state, i.e. actions that push the transition distribution far from the average. The second term drives exploitation as usual. The parameter \(\alpha\) controls how hungry the agent is for extrinsic reward versus curiosity — Still and Precup suggest you could set it by the robot’s battery level, which is a nice touch.
So that’s cool. Exploration vs. exploitation emerges from the optimisation rather than being bolted on as an \(\epsilon\)-greedy hack or whatever. Even as \(\lambda \to 0\) (deterministic policy), the agent still balances both drives.
4 Intrinsically motivated RL via salient events (Singh, Barto, Chentanez)
This one (Chentanez, Barto, and Singh 2004) comes from more of a developmental-psychology direction. Instead of information-theoretic quantities, Singh, Barto, and Chentanez start from the observation that animals find certain events intrinsically salient — unexpected changes in light, sound, or tactile sensation trigger phasic dopamine responses that diminish with familiarity. Toddlers do this; they poke at things until the poking gets boring, then they find something new to poke at.
Their agent operates in the options framework (semi-Markov decision processes). When it first encounters a salient event \(e\) — say, a light turning on — it creates an option \(o_e\): a temporally extended skill for reproducing that event. The intrinsic reward for salient event \(e\) at time \(t+1\) is:
\[r^i_{t+1} = \tau\bigl[1 - P^{o_e}(s_{t+1} | s_t)\bigr]\]
where \(P^{o_e}\) is the learned option model’s prediction of reaching the salient state, and \(\tau\) is a scaling constant. So reward is proportional to the prediction error of the learned skill model: novel events yield high intrinsic reward, which fades as the option model improves and the event becomes predictable. The agent gets bored.
Architecturally, the agent uses intrinsic reward to update a behaviour action-value function \(Q_B\) via Q-learning, in parallel with updating the individual option policies \(Q^o\) via intra-option learning. Intrinsic reward drives skill acquisition (which options to practise); extrinsic reward drives task performance (which options to deploy). The two reward streams are additive but architecturally separated.
They illustrate this with some intuitive “playroom” experiments. The agent discovers a developmental sequence: it first masters simple skills (light on, light off), gets bored with them as intrinsic reward diminishes, then moves on to harder compound skills (turning on music requires light and finding and pressing a block). The hierarchy self-organises from the interaction of curiosity and prediction improvement, without any curriculum. And when an extrinsic task finally arrives (make the toy monkey cry — a 14-step procedure!), the intrinsically motivated agent solves it dramatically faster than a purely extrinsic learner, because it has already assembled the prerequisite skill library. It has been dicking around productively, as toddlers do.
Whilst I do not find the formalism here very satisfying, I find the resulting agents intensely personally relatable.
5 Novelty search (Lehman and Stanley)
Novelty search (Lehman and Stanley 2011) takes a radical position: throw out the objective function entirely. (Down with utility!!) Instead of rewarding fitness, reward behavioural novelty. Define a behaviour characterisation \(b(\theta)\) for each individual \(\theta\) in an evolutionary population, and maintain an archive \(\mathcal{A}\) of previously encountered behaviours. The novelty score is the average distance to the \(k\)-nearest neighbours in the archive:
\[\rho(b) = \frac{1}{k}\sum_{i=1}^{k} \|b - b_i\|\]
where \(b_1, \dots, b_k\) are the \(k\) nearest archived behaviours. Selection pressure pushes the population to keep exploring new regions of behaviour space (not genotype space, nor fitness space).
This sounds absurd — how can you solve problems without trying to? — and I think it is a bit absurd, at least in the general case. But in deceptive fitness landscapes, pursuing the objective directly leads you into local optima; novelty search escapes deception because it doesn’t care about the objective at all; it just keeps expanding the frontier of what’s been tried. And empirically, in maze navigation and other deceptive domains, novelty search often finds the goal faster than objective-driven search, precisely because the goal is reachable only via stepping stones that don’t look like progress.
Q: how is this “novelty search” different from the curiosity drive? I think the answer is that it operates at the population level rather than within a single agent’s lifetime. The “reward” is behavioural diversity, which is an evolutionary analogue of the compression-progress idea — except that the “model” being improved is the archive’s coverage of behaviour space.
6 Empowerment
I have a whole page on empowerment which is a huge topic in itself. tl;dr: Empowerment is the channel capacity between an agent’s actions and its future sensory states,
\[\mathfrak{E}(s) = \max_{p(a)} I(A;\, S’)\]
where the max is over action distributions and the mutual information measures how much the agent can influence its own future. An empowerment-maximising agent keeps its options open: it avoids dead ends, gravitates toward states with many reachable futures, and tends to acquire “power” in the instrumental-convergence sense (Turner et al. 2021; Omohundro 2018). This is maybe the flavour of intrinsic motivation that AI safety people worry about most.
7 Interactivity
Lewandowski et al. (2025) propose interactivity as an intrinsic motivation objective that swaps Shannon information for algorithmic (Kolmogorov) information:
Interactivity is similar to previously considered intrinsic motivation objectives (Chentanez, Barto, and Singh 2004; Schmidhuber 2010), and specifically predictive information (Bialek, Nemenman, and Tishby 2001b; Still and Precup 2012). However, interactivity uses algorithmic information rather than Shannon information, which can operate directly on individual sequences rather than requiring probability distributions. This sequence-based formulation provides a natural framework for continual adaptation, in which an agent’s behaviour is treated as an individual sequence.
The key quantity is the difference between the algorithmic complexity of future behaviour with and without conditioning on past experience. This sidesteps a real limitation of the Shannon-information approaches: they need well-defined probability distributions, which may not exist for a single agent living a single non-stationary lifetime. Algorithmic information operates directly on individual sequences, which makes it a more natural fit for continual learning. Whether this is practically computable is another question, but as a theoretical foundation I think it is heading in a good direction.
8 Active inference and free energy minimisation (Friston)
Karl Friston’s free energy principle looks, from a certain angle, like an intrinsic motivation theory with the boldest possible scope: all adaptive behaviour is (variational) free energy minimisation. I have reservations about some of the stronger claims here, but the connection to the above is real enough that it would be remiss to skip it.
An agent maintains a generative model \(p(\tilde{s}, \vartheta \mid m)\) of how sensory data \(\tilde{s}\) arise from hidden causes \(\vartheta\), and a recognition density \(q(\vartheta \mid \mu)\) parameterised by internal brain states \(\mu\). The variational free energy is
\[F = -\langle \ln p(\tilde{s}, \vartheta \mid m)\rangle_q + \langle \ln q(\vartheta \mid \mu)\rangle_q\]
which is the negative ELBO from variational Bayesian inference. Minimising \(F\) with respect to \(\mu\) (perception) tightens the approximate posterior. The unique selling point of active inference is that the agent can also minimise \(F\) with respect to actions \(a\) — by changing the world to match its predictions, rather than only changing its beliefs to match the world.
So where Schmidhuber’s agent seeks compression progress (surprise that becomes learnable), Friston’s agent seeks to minimise surprise outright. This, as many people have noticed, sounds like a recipe for the “dark room problem” — just go sit in a dark room where nothing surprising ever happens, and you’ll minimise free energy perfectly. Friston’s response, as far as I understand it, is something like the generative model encodes priors over expected sensory states (homeostatic setpoints, essentially), so a hungry agent predicts it will eat, and acts to make that prediction true. The “intrinsic motivation” is already baked into the prior.
I find this answer unsatisfying — it feels like it relocates the hard problem into the prior rather than solving it — but reasonable people disagree.
Schmidhuber pointed out the tension in Schmidhuber (2010): Friston’s agents want to suppress prediction error, while curiosity-driven agents want to seek it out (then reduce it through learning). In active inference, “perception tries to suppress prediction error by adjusting expectations […] while action tries to suppress prediction error by changing the signals being predicted.” This is stabilising, not exploring. A curious agent, by contrast, is motivated to leave the dark room because novel environments offer compression progress.
Hafner et al. (2022) attempt a reconciliation by showing that both action and perception can be cast as divergence minimisation — with different target distributions. Under their formulation, active inference and intrinsic motivation objectives like empowerment and information gain emerge as special cases of the same variational framework, differing only in which KL divergence we minimise and which distribution we hold fixed.
I need to read that paper more deeply. It sounds like the right way to think about it: the free energy principle is less a competing intrinsic motivation theory and more a very general variational language in which many of the other theories can be expressed. Whether it adds predictive power beyond that expressiveness is still debated.
9 What connects these all?
Most of these formalisations are rearrangements of the same few information-theoretic building blocks. Curiosity rewards maximise \(I(\text{past};\text{future})\) or its time derivative. Empowerment maximises \(I(\text{actions};\text{future states})\). Policy complexity penalties minimise \(I(\text{states};\text{actions})\). Novelty search maximises coverage in behaviour space, which is an implicit entropy maximisation. Free energy minimisation is the negative ELBO, i.e. it’s variational inference with aspirations.
The choice of which mutual information to maximise (or minimise, or differentiate) reflects different assumptions about what makes an agent good at learning. Schmidhuber’s agent wants to get better at predicting; Still and Precup’s agent wants its behaviour to be informative; empowerment-seeking agents want causal influence; novelty-searching populations want diversity. I don’t think these are really competing theories so much as different projections of a common intuition: an agent that can’t yet solve any particular problem should spend its time becoming the kind of agent that could solve many problems. Which is, come to think of it, also my career strategy.
10 Cousin formalism: Bayesian optimisation
If the explore/exploit trade-off above rings a bell from a completely different context, it should. Adaptive design of experiments (a.k.a. “Bayesian optimisation,” though I refuse to call it that) faces the same structural problem: you have an expensive-to-evaluate black-box function, a surrogate model (usually a Gaussian process), and you need to decide where to sample next. The acquisition function — expected improvement, upper confidence bound, entropy search, etc. — is doing exactly the work of an intrinsic motivation signal: it tells you which input is most worth evaluating, given what you already know.
The resemblance to curiosity-driven RL is right there in front of our eyes. In both cases, the agent is managing a posterior over a world model and choosing actions to maximise some information-theoretic quantity (information gain, expected model improvement, predictive variance reduction). Still and Precup’s KL-divergence curiosity term and the entropy-search acquisition function are practically the same object in different notation.
I thought I could summarise the differences compactly, but instead I persuaded myself that this is maybe slightly deeper than I thought it was. Bayesian optimisation typically assumes the phenomenon comes from some known family — often a GP prior, i.e. something like a smoothness assumption over an input space with known dimensionality. The goal is to find a specific optimum of a specific function. Intrinsic motivation, by contrast, is not trying to optimise any particular function; it is trying to produce an agent that is generically competent in an environment whose structure is largely unknown. There is no fixed acquisition target, and the “surrogate model” is the agent’s entire world model, which may have to deal with non-stationarity, partial observability, and its own actions changing the thing it is modelling. But what is “generic competence” anyway? Under some choice of world and competencies, I think these formalisms collapse into one another.
Let us ignore that for now. So: Bayesian optimisation is the well-behaved cousin who went to a good school and has a clear objective. Intrinsic motivation is the feral version of the same impulse, operating without the comforting assumption that you know what you are looking for. But the maths is close enough that insights flow in both directions, and I think there is unexploited territory in making the connection tighter.1
11 What’s missing: bodies
One thing that bugs me about all of the above: none of these formalisations have anything to say about the physical cost of curiosity. A real agent — biological or robotic — runs on a finite energy budget. You cannot maximise compression progress if you are starving. You cannot seek out novel states if your battery is flat. Every bit of mutual information you compute or action you explore dissipates heat and consumes free energy in the boring, thermodynamic sense.
This is something we would need to actually model to explain why real organisms have drives that compete with curiosity (hunger, fatigue, thermoregulation), and the interaction between those drives and the exploratory ones is arguably where most of the interesting behaviour comes from. A toddler that explored with no metabolic constraints would be a very different beast from an actual toddler, who explores in bursts between naps and snacks.
There are a few threads that start to connect the information-theoretic formalisations above to thermodynamic reality. Susanne Still (same person as the curiosity-in-RL paper) showed that predictive information has a direct thermodynamic cost: any system that predicts its future inputs can in principle reduce the thermodynamic dissipation from driving its states, but only up to a bound set by the mutual information between its internal states and the future (Still et al. 2012). So there is a physical exchange rate between prediction and energy expenditure, which is exactly the kind of thing you would need to bridge intrinsic motivation to metabolic constraints. (Ortega2011Information?) take a related approach via bounded rationality, treating the KL divergence between a policy and a default (i.e. the “complexity penalty” in Still and Precup’s framework) as a computational or thermodynamic cost, yielding a free-energy-style objective where the temperature parameter literally controls how much work the agent can afford to do. But as far as I can tell, nobody has yet built a full intrinsic motivation framework where the agent’s curiosity drive is explicitly modulated by its energy budget in a thermodynamically principled way. The pieces are all there; someone should put them together. TODO.
12 References
Footnotes
The adaptive design of experiments page also notes the connection to RL but punts on the details. Someone should sort this out properly. TODO↩︎
