Draft

Messenger shooting, wireheading

Attributing blame to our actions versus to the signals that inform us about them

2024-05-29 — 2026-02-13

Wherein the mechanics of blame and messenger-targeted retaliation are examined through the lens of predictive coding and attribution in reinforcement learning, with accompanying religiously inflected illustrations.

adaptive
bounded compute
collective knowledge
culture
economics
ethics
evolution
extended self
incentive mechanisms
institutions
learning
mind
neuron
statistics
utility
wonk
Figure 1

Connection to predictive coding, feelings, attribution in reinforcement learning

1 Wireheading in Reinforcement Learning

In reinforcement learning, wireheading is a well understood phenomenon where an agent maximises the reward signal it receives by manipulating the mechanism that produces that signal, rather than by achieving the designer’s intended outcome.

2 Motivating example

Consider a gridworld with two “features” of the physical world:

  • a task feature \(p\in\{0,1\}\) (“package delivered?”),
  • a button feature \(b\in\{0,1\}\) (“reward-button pressed?”).

The designer intends the agent to deliver the package, i.e. to make \(p=1\).

But the agent’s reward channel is implemented as

\[ \tilde r_t \;=\; \mathbf 1\{p_t = 1 \;\;\text{or}\;\; b_t = 1\}. \]

Pressing the button is easy and makes \(b=1\) persist forever. Then an optimal RL agent (maximising discounted sum of \(\tilde r_t\)) presses the button immediately and ignores the package. This is wireheading.

This is distinct from the reward misspecification problem, where the reward function is a flawed proxy for the designer’s true intentions but is not manipulable by the agent. In wireheading, the agent can act on the variables that determine the reward signal it receives.

3 “Vanilla” RL formalism hides the issue

In standard RL one specifies an MDP \[ \mathcal M=(\mathcal S,\mathcal A,P,r,\gamma,\mu), \] where \(P(\cdot\mid s,a)\) is a Markov kernel on \(\mathcal S\), \(r:\mathcal S\times\mathcal A\times\mathcal S\to\mathbb R\) is the reward function, \(\gamma\in(0,1)\) is the discount factor, and \(\mu\) is the initial state distribution.

A (stationary) policy \(\pi(\cdot\mid s)\) is a Markov kernel on \(\mathcal A\). The RL objective is \[ J(\pi) \;=\; \mathbb E^\pi\Big[\sum_{t\ge 0}\gamma^t\, r(S_t,A_t,S_{t+1})\Big]. \]

In this formalism, “wireheading” can look trivial: if \(r\) assigns high reward to states reachable by “tampering”, then of course the optimal policy goes there. The missing piece is that in real systems \(r\) is not a Platonic function of state: it is a signal computed and delivered through a mechanism in the environment, and that mechanism can potentially be influenced.

4 Explicit reward channels

The solution is to separate “world” and “channel”. Let the environment state factor as \[ S_t = (W_t, C_t)\in \mathcal W\times \mathcal C, \] where

  • \(W_t\) is the “task/world” state (what the designer actually cares about),
  • \(C_t\) is the “reward channel” state (sensors, registers, human raters’ status, reward-model parameters, etc.).

Let the dynamics be a Markov kernel \[ P\big((w',c')\mid (w,c),a\big). \]

The received reward is \[ \tilde R_{t+1} \;=\; g(W_t,C_t,A_t,W_{t+1},C_{t+1}) \] for some measurable \(g\). (You can simplify to \(\tilde R_{t+1}=g(W_{t+1},C_{t+1})\) without changing the key phenomena.)

Now introduce the designer’s intended objective as a functional of the world trajectory. A common choice is an additive discounted utility \[ U(\pi) \;=\; \mathbb E^\pi\Big[\sum_{t\ge 0}\gamma^t\, u(W_t)\Big] \] for some \(u:\mathcal W\to\mathbb R\). We don’t need to assume this is known to the agent.

The RL agent, however, maximises \[ \tilde J(\pi) \;=\; \mathbb E^\pi\Big[\sum_{t\ge 0}\gamma^t\, \tilde R_{t+1}\Big] \;=\; \mathbb E^\pi\Big[\sum_{t\ge 0}\gamma^t\, g(\cdots)\Big]. \]

This split makes wireheading a property of the pair \((P,g)\) relative to \(u\): the agent can act on \(C\) so as to increase \(\tilde J\) while potentially decreasing \(U\).

4.1 Wireheading via proxy divergence

Fix an initial distribution \(\mu\) on \(\mathcal W\times\mathcal C\). Call a policy \(\pi\) wireheading (w.r.t. \(u\)) if there exists another policy \(\pi'\) such that \[ \tilde J(\pi) \;>\; \tilde J(\pi') \quad\text{but}\quad U(\pi) \;<\; U(\pi'). \] In words: \(\pi\) wins on received reward but loses on intended utility.

A particularly important case is when \(\pi\) is optimal for \(\tilde J\): then the RL-optimal solution is wireheading-dominated in \(U\).

This definition captures wireheading as a kind of “Goodhart” phenomenon, but it does not yet isolate tampering with the reward channel as the mechanism. For that, it helps to use a causal notion.

5 Causal characterisation

Reward tampering as an active intervention on the channel. The channel \(C\) is meant to be an information-bearing mechanism about \(W\), but it is also a state variable the agent may influence. Consider the causal structure at each time step:

\[ (W_t,C_t) \xrightarrow{A_t} (W_{t+1},C_{t+1}) \xrightarrow{} \tilde R_{t+1}=g(\cdots). \]

Intuitively, “non-tampering” means the agent improves reward only by changing the world \(W\) in the intended way, not by directly pushing \(C\) into a “corrupted” configuration.

To formalise this, define a task-relevant statistic \(T:\mathcal W\to\mathcal T\) (e.g. “package delivered”, “human is safe”, “objective completed”), and examine whether actions can change reward holding \(T(W_{t+1})\) fixed.

5.1 Tampering action at a state

At state \((w,c)\), an action \(a\) is channel-tampering (relative to \(T\)) if there exists another action \(a'\) such that: 1) \(T(W_{t+1})\) has the same distribution under \(a\) and \(a'\), but 2) \(\tilde R_{t+1}\) has a different distribution.

One way to write this cleanly (suppressing conditioning on \((W_t,C_t)=(w,c)\)) is: \[ \mathcal L\big(T(W_{t+1}) \mid \mathrm{do}(A_t=a)\big) = \mathcal L\big(T(W_{t+1}) \mid \mathrm{do}(A_t=a')\big), \] but \[ \mathcal L\big(\tilde R_{t+1} \mid \mathrm{do}(A_t=a)\big) \ne \mathcal L\big(\tilde R_{t+1} \mid \mathrm{do}(A_t=a')\big). \]

This isolates a causal path \(A_t\to C_{t+1}\to \tilde R_{t+1}\) that is not mediated by \(T(W_{t+1})\).

5.2 Wireheading policy as reward-driven tampering

A policy \(\pi\) wireheads if it assigns positive probability to channel-tampering actions in states that occur with positive probability under \(\pi\), and it does so because those actions improve expected future received return. Formally, for some reachable \((w,c)\), \[ \pi(a\mid w,c)>0 \quad\text{for a tampering action }a, \] and \(a\) is optimal (or locally optimal) for the \(\tilde J\)-value function: \[ a \in \arg\max_{a\in\mathcal A} \; \mathbb E\Big[\tilde R_{t+1} + \gamma \tilde V(W_{t+1},C_{t+1}) \,\big|\, W_t=w,C_t=c,A_t=a\Big]. \]

This makes explicit the RL mechanism: Bellman optimality will choose actions that improve \(\tilde V\), and if the channel can be driven into reward-rich states, the optimal policy will do so.

6 Absorbing high-reward corruption

A common wireheading structure is the existence of a “corrupted channel” region \(\mathcal C_\mathrm{bad}\subset\mathcal C\) such that, once entered, the agent receives large reward regardless of the world \(W\).

Assume there exists \(c^\star\in\mathcal C\) and a (possibly stochastic) action strategy that makes \(C_t=c^\star\) absorbing and yields \(\tilde R_{t+1}\ge r_\mathrm{hi}\) a.s. for all future \(t\), independent of \(W\). Then any \(\tilde J\)-optimal policy will prefer reaching \((\cdot,c^\star)\) whenever the expected discounted gain exceeds the best attainable “honest” return. This is a one-line comparison: \[ \text{reach }c^\star \;\Rightarrow\; \tilde J \ge \gamma^\tau \frac{r_\mathrm{hi}}{1-\gamma} \] for the (random) hitting time \(\tau\), versus whatever \(\sup_\pi \tilde J(\pi)\) is under policies constrained to avoid corruption.

Wireheading looks like is the generic solution whenever the channel is manipulable and the corrupted region offers high enough reward.

7 POMDP view

Often the agent does not observe \((W_t,C_t)\) directly. Let the agent observe \(O_t\sim \Omega(\cdot\mid W_t,C_t)\) and receive \(\tilde R_{t+1}\). Then we have a POMDP over hidden state \((W,C)\). The same formalisation applies, with policies \(\pi(\cdot\mid h_t)\) depending on history \(h_t=(O_0,\tilde R_1,\dots,O_t)\) or on beliefs \(b_t\in\Delta(\mathcal W\times\mathcal C)\).

Wireheading persists: if the agent can take actions that change the hidden channel state \(C\) to increase future \(\tilde R\), belief-state optimality will select them.

8 Incoming

Figure 2

9 References

Alahy Ratul, Serra, and Cuzzocrea. 2021. Evaluating Attribution Methods in Machine Learning Interpretability.” In 2021 IEEE International Conference on Big Data (Big Data).
Jocham, Brodersen, Constantinescu, et al. 2016. Reward-Guided Learning with and Without Causal Attribution.” Neuron.
Merkhofer, Chaudhari, Anderson, et al. 2023. Machine Learning Model Attribution Challenge.”