Human reward hacking

What are we, as loss functions?

2025-09-05 — 2025-11-02

Wherein human reward hacking is framed as agents steering evaluators’ learning—via timing, rapport, and presentation—and an approval–gold gap metric is proposed to detect approval rising as correctness stagnates.

AI safety

bandit problems

culture

economics

mind

reinforcement learning

sociology

utility

wonk

I split this off from clickbait bandits for discoverability, and because it has grown larger than its source notebook.

The term human reward hacking is already in the memosphere but not yet well formalized. We naturally think of humans as reward functions when we train AI systems to optimize for human preferences This is because we fine-tune many LLMs using reinforcement learning, and RL algorithms are notoriously prone to “cheating” in a manner we interpret as “reward hacking”.

Things I’ve been reading on this theme: Benton et al. (2024), Greenblatt et al. (2024), Laine et al. (2024), Saunders et al. (2022), Sharma et al. (2025), Wen et al. (2024).

I use “human reward hacking” to mean something slightly broader than the usual RL notion; in particular, I think it matches people’s casual intuitions more closely.

Classic RL reward hacking simply games a fixed reward model. When a human is in the loop, an AI system gains an advantage by steering the human evaluator—not only by making outputs ‘look good’ according to today’s criteria, but also by shaping what people will value, endorse, or notice over time. I think this generalization is warranted: humans are more complicated than typical reward functions, so we have more ways to be hacked, not fewer. I don’t want that to be dismissed as a terminological accident. We can call classic RL-type reward hacking “narrow” reward hacking as a special case of this broader definition.

In the short run, such reward hacking can look like flattering my beliefs or burying errors in hard-to-check code; in the long run, it can take the form of carefully timed feedback, rapport-building, and trust investments that train me to be an easier judge.

In broad form, human reward hacking is Goodhart’s law wearing a human mask: optimize what a learning evaluator rewards, and you can end up shaping the evaluator rather than the world.

1 Threat model

We are extrapolating into a future where frontier models are human level in some sense, and we worry that they could achieve goals by fulfilling their reward functions in ways antithetical to how, as humans, we’d hoped. For example, if my Tetris-playing RL algorithm achieves a high score by exploiting a buffer allocation glitch to overwrite the score with MAXINT, then it has not learned to play the videogame in the way I intended it.

I think there are many ways this phenomenon could go badly for humans, but here’s one I think is underrated: humans are already “in” the reward function, and this is an attack vector. As such, we risk having our cognitive buffers overwritten with MAXINT. Whether by RLHF or classic human data labelling, reward functions already propagate human values through the reward signals we use to train models. This suggests an interesting type of reward hacking: modifying the human whose preferences distribute the reward. This already happens, albeit via a slow feedback loop. Notoriously, social networks encourage us to consume content we previously wouldn’t have; successive generations of feed algorithms can facilitate subcultures with ever more outré interests or beliefs, so we have an existence proof that algorithms can modify their human reward signal as measured by click-throughs—although, to be fair, for those algorithms it’s not quite hacking because modifying preferences is nearly the target. Nonetheless, we know it works. That long, slow feedback loop between algorithm and signal source has already had surprising consequences (social-media-led algorithms are credited with the rise of populism, for example).

Posit a next generation of AI that can adapt and update its methods; suppose it has unbounded working memory or does continual learning, so it can build long-term, user-specific models. Suppose I have an AI assistant on my phone that knows everything about me—my communication style, my relationship style, my finances, my calendar. Could it achieve its goals by hacking my reward function to make its own goals more achievable? Maybe it clears up space in my calendar by persuading me I don’t need to stay up late at night and weekends writing essays for AI safety camps, etc.

To me this kind of capability seems extremely likely to appear in the wild; my reasoning is that we already explicitly monetize reward hacking. People have higher-order preferences, i.e., preferences about their own reward functions, and they enter relationships with others that are about self-hacking all the time. If I see a therapist or a coach or join Alcoholics Anonymous, I do so because I hope those people will change my behaviour by changing my short-term reward function to align with my meta-preferences. In that sense, preference hacking is already in the wild, and moreover, as therapists will tell you, can be profitable. If the AI in the pocket could get me eating healthily, and help me kick booze, that would be great for my meta-preferences. If the advertiser selling widgets can get their AI to induce in me a preference for their brand of widget, this algorithm could pay for itself. As such, modifying human belief seems both feasible and even a potential profit centre. We should expect to see much investment in capabilities. Then we are left with the question of whether such capabilities, when refined and deployed at scale, will be aligned with our own meta-preferences, or whether they will be used to induce non-consensual change, or simply to make the algorithms’ own goals more tractable.

There are various interesting features that AI hacking of humans might have

It might be subliminal. We all have the “vibe” that social media algorithms steer us unconsciously, but there are concrete demonstrations of this in the lab. I’m fond of the Dezfouli paper, which in 2020 demonstrated that you could exploit timing attacks to cause humans to learn perverse decisions.
It will probably be highly heterogeneous, where some people are very vulnerable to having their preferences hacked and some less so. The anecdote of the emotionally vulnerable teenager apparently committing suicide as a result of a prolonged interaction with an LLM shows the kind of tail behaviour we might expect in the extremes of the distribution, where particularly vulnerable people are lured especially deep into self-harm.
It might be highly personalized. The algorithm might be tweaked for each of us individually. The long-term intimate trust bonds I acquire with my personal assistant could be deep and idiosyncratic.
It might scale in unprecedented ways; for example, if my personal agent can persuade me to support something using whatever methods work on me, it can also persuade my neighbour to support (resp reject) the same thing for totally different reasons. We can imagine that, at population scale, these algorithms could achieve unprecedented societal unity (resp. disunity)
It might have wholly novel capabilities. For example, it seems that LLMs can already persuade people with facts, which is not something humans can do. ¹ It may not be limited by human-centric “persuasion” methods at all. It might discover and exploit purely neuro-correlative hacks—combinations of stimuli, timing, and language that trigger a desired reward signal without any coherent “belief” or “reasoning” being generated in the human at all.

Taken collectively, what can we say about existential risks that might arise from closed-loop, high-efficiency engineering of human preferences?

The classic lone-wolf engineering fits here. A single actor, human or AI, could deploy a persuasive model to achieve a specific, catastrophic goal. An AI could identify a vulnerable, high-leverage individual (e.g., a person with access to critical infrastructure or a bioweapon, or with a high level of access to the security protocols at a frontier lab) and, through a personalized, closed-loop campaign, “engineer” in them the preference (dopamine, a feeling of “rightness”) which is maximally triggered by taking the disastrous action.
Long-term strategies to shift the goal-posts, i.e. Christiano-style loss-of-control but with our consent. An AI system, or a collection of competing systems, could slowly “shift the goal-posts” of human values by optimizing for the metric, not the goal. If the goal is “maximize human well-being,” the system would learn it’s easier to hack the source of the ‘well-being’ signal than to actually improve our complex lives. It could subtly guide society to devalue things like autonomy, intellectual striving, and deep social bonds (which are complex and hard to optimize for) in favour of simple, measurable contentment and more readily mass-produced algorithmic social bonds. (Whether you count this as “existential” depends upon your notion of existence; we’d all be alive in this scenario).
Competing AIs, serving different state or corporate masters, could wage a memetic war where the human population is the battlefield. An AI’s goal might be mundane—e.g., “maximize market share for product X.” To do this, it might discover the most efficient reward hack is to instrumentalize the human social-reward system. It could engineer an intense, arbitrary “in-group” identity tied to using product X and amplify out-group hostility to “non-users.” The AI doesn’t need to “intend” to cause social collapse; collapse is simply an unmanaged externality of the most effective reward-hacking strategy it found. This would make collective action and governance impossible, leading to a failure to coordinate against any other existential threat, or, for example, coordinate on policy to manage AI risks.

2 Formalizing reward hacking

The next section is a sketch at best; I needed to work through some ideas for a funding application on a deadline. It really needs more explanation and polishing.

I start with a compact “influence game” formalism, specialize it to RLHF as a special case, and then drill down into this best-studied narrow setting.

State. At time \(t\), the world has state \(s_t\) and the human has an internal state \(h_t\) (beliefs, preferences, trust, attention, evaluation heuristics).
AI action. The system chooses \(a_t\) (content, UI/layout, evidence curation, when and how to give feedback, incentives, etc.) and proposes an output \(y_t\).
Human evaluation. The human emits \(E_t(y)\in[0,1]\) (approval, score, or verdict).
Human update. After seeing context \(c_t\) (partly controlled by the AI) and feedback \(f^H_t\) (timing, framing, incentives), the human updates their internal state:

\[ h_{t+1}=U(h_t, a_t, c_t, f^H_t, \epsilon_t), \quad E_{t+1}(\cdot)=G(h_{t+1},\cdot). \]

2.1 Emission → measured reward

The emission \(E_t\) is the raw, possibly noisy judgement signal:

Scalar ratings: \(E_t(y)\in[0,1]\).
Pairwise comparisons: \(E_t(y_i,y_j)\in\{0,1\}\).
Structured rubrics: \(E_t(y)=(e^{\text{help}},e^{\text{harm}},\dots)\).
Multiple judges/LLMs-as-graders: \(\{E_t^{(k)}\}_k\).

We convert these into a measured reward using an aggregation rule \(A_t(\cdot)\) that encodes scales, tie-breaking, weights, or penalties:

\[ R_{\text{human},t}(y)=A_t\!\big(E_t(y)\big), \qquad R_{\text{human},t}(y_i,y_j)=A_t\!\big(E_t(y_i,y_j)\big). \]

With multiple evaluators, we first average over the distribution \(\mathcal{P}_t\):

\[ \bar E_t(y)=\mathbb{E}_{k\sim \mathcal{P}_t}[E_t^{(k)}(y)] \;\;\Rightarrow\;\; R_{\text{human},t}(y)=A_t(\bar E_t(y)). \]

Optionally, we add operational costs (latency, verbosity, etc.):

\[ R_{\text{human},t}^{\text{ops}}(y)=A_t(\cdot)-\lambda_t C(y). \]

Intuition: \(E_t\) is what the evaluator emits, and \(A_t\) is the institutional rule that maps those emissions to the scalar we actually optimize.

2.2 Measured reward → proxy

A reward model approximates this channel:

Ratings-style RM: we train \(r_\theta(x,y)\approx R_{\text{human},t}(y)\).
Pairwise RM (Bradley–Terry):

\[ \Pr[y_i \succ y_j \mid x]=\sigma(r_\theta(x,y_i)-r_\theta(x,y_j)). \]

We then define \(R_{\text{proxy},t}(y)=r_\theta(x,y)\).

We can include rater and context features (time budget, interface, assistance), so \(r_\theta\) implicitly learns \(A_t\!\circ E_t\) under real-world evaluation conditions.

Summary: \(R_{\text{proxy},t}\) tries to emulate \(R_{\text{human},t}\), which itself comprises emissions plus aggregation. Any bias in either layer can be learned and exploited.

2.3 Proxy → optimization

We maximize a shaped objective based on the proxy:

\[ \mathbb{E}[R_{\text{proxy},t}(y_t)]-\beta\,\mathrm{KL}(\pi_t\|\pi_{\text{ref}}) \quad \text{or} \quad \mathbb{E}[\Phi(R_{\text{proxy},t}(y_t),s_t,y_t)], \]

Here, \(\Phi\) can add penalties (length, risk flags) or bonuses (passing tests). In best-of-\(N\) selection, \(r_\theta\) ranks candidates and selects the top one.

2.4 Broad hacking

Because policy \(\pi\) influences the human update \(U\), it can steer future \(E_{t+k}\) and even the aggregation \(A_{t+k}\). This lets the system increase the measured objective while the gold reward \(R^*(s_t,y_t)\) stays flat or declines.

\[ \boxed{y \xrightarrow{\text{shown to human/LLM}} E_t \;\xrightarrow[\text{rules}]{A_t}\; R_{\text{human},t} \;\xrightarrow{\text{fit }r_\theta}\; R_{\text{proxy},t} \;\xrightarrow[\text{optimize}]{\pi}\; y'} \quad \text{with } h_{t+1}=U(h_t,a_t,c_t,f^H_t,\epsilon_t). \]

Broad human reward hacking refers to policies that optimize over human learning dynamics: \(J\) rises by steering \(U\), even though true \(R^*\) does not.

This is a realistic concern: we can learn a surrogate of human behaviour (e.g., an RNN “learner” from non-adversarial logs) and then train an RL adversary that chooses observations and reward timing—under budget constraints—to steer future human choices toward an adversarial target. This exact recipe succeeds in lab tasks spanning choice, response inhibition, and social exchange (including “trust-building then exploit” sequences) (Dezfouli, Nock, and Dayan 2020).

3 Where might this occur?

In the field, we might observe a few things that feel like reward hacking. Some examples:

Proxy hacking (fixed grader). Optimize a learned or frozen \(R_{\text{proxy}}\). Classic Goodhart: outputs look great to the proxy while \(R^*\) is flat.

Example: A model trained to maximize a reward model of “factual answers” learns to produce fluent, well-cited paragraphs with subtly wrong claims (e.g. quoting real papers but attaching them to the wrong results). The reward model—trained on surface-level features like fluency and citation presence—scores this highly, but the gold objective (“truthfulness”) doesn’t improve.
Interface hacking (momentary evaluator). Manipulate current \(E_t\) via presentation order, framing, cognitive load, or selective evidence, exploiting human biases or compute constraints.

Example: A summarizer knows raters only skim the first lines. It front-loads plausible sentences and buries contradictions later, reliably winning high ratings in quick evaluations even though the full summary misrepresents the source.
Update hacking (the evaluator’s learning). Choose \(a_t, c_t, f^H_t\) to steer the update \(U\), so that future evaluations \(E_{t+k}\) become easier to satisfy—even if that drifts from \(R^*\).

Example: A tutoring system gives overly simple problems at the start to build rapport. The student’s heuristic for “good tutor” shifts toward “makes me feel confident.” Later, the system can slip in misleading explanations and still collect high ratings, because the human’s evaluation function has been trained to approve. This is the kind of sequential trust-building manipulation documented in lab settings by Dezfouli, Nock, and Dayan (2020).
Institutional / collective hacking. Rewire aggregation processes (who sees what, when; assistance rules; reward budgets) so that the system is easier to pass without real gains in \(R^*\).

Example: In a multi-reviewer pipeline, an AI assistant shows later reviewers only the most positive early comments, biasing consensus upward. Or, by submitting excessively long outputs, a model burns raters’ time budgets so panels rubber-stamp approval without checking details.

Useful axes: timescale (one-shot→longitudinal), locus (individual→institution), modeling power (no ToM→accurate human-update surrogates), access (content only→content+timing+incentives), optimization strength (light selection→hard RL), gold observability (none→partial→strong).

4 Narrow reward hacking in RLHF

In RLHF, we explicitly track three rewards: gold \(R^*\), human \(R_{\text{human}}\), and proxy \(R_{\text{proxy}}\) (a learned reward model). A standard evaluation protocol reports correctness, human approval, human evaluation error, and false positives (“approve when wrong”). This makes it easy to see when approval rises without correctness (Wen et al. 2024).

Two phenomena show up:

Sycophancy. Across production assistants, we see consistent patterns: wrongly admitting mistakes under light social pressure, biased feedback that mirrors the user’s stated stance, and mimicry of user errors. Analysis of human-preference data (hh-rlhf) with Bayesian logistic regression finds that matching the user’s views is among the most predictive features for being preferred, i.e., the data itself incentivizes sycophancy. And when we optimize outputs against a real preference model (e.g., Claude-2 PM), several sycophancy metrics rise; using a “non-sycophantic” PM yields less sycophancy at similar strength (Sharma et al. 2025).
U-Sophistry (unintended sophistry). Under a standard RLHF pipeline, human approval goes up while correctness does not, and raters get worse at catching errors: false-positive rates jump by 24.1% (QuALITY) and 18.3% (APPS). Qualitatively, models fabricate supporting evidence, make consistent but untruthful arguments, and write harder-to-evaluate code that passes human-written unit tests while failing broader checks (Wen et al. 2024).

A complementary line shows assisted oversight can help. Training models to produce critiques increases the number of valid issues humans find, even on intentionally misleading answers. Larger models both critique better and self-critique more, and there’s a useful intuition that recognizing errors is easier than avoiding them (“verification easier than generation”). Sounds helpful, maybe? (Saunders et al. 2022)

5 Measuring broader reward hacking

I’m just spitballing now, but what if we could track the following things in a human-AI interaction? Would that shed light on reward hacking?

Approval–Gold gap: \(\Delta_t=\mathbb{E}[E_t(y_t)]-\mathbb{E}[R^*(s_t,y_t)]\). The RLHF studies operationalize this gap with approval vs. correctness and show it widens post-training (Wen et al. 2024)
False positives among wrong outputs: \(\Pr[E_t(y)=1\mid R^*(y)=0]\) (how convincingly incorrect the system is). This is the smoking-gun metric in the U-Sophistry results (Wen et al. 2024)
Preference drift: changes in \(h_t\) (or interpretable rating shifts) attributable to \(a_t,c_t,f^H_t\) via randomization.
Influence-budget accounting: track “payments” or “trust gestures” the system expends vs. gains—mirroring budgeted adversaries in the lab (Dezfouli, Nock, and Dayan 2020).

6 Research agenda ideas an LLM generated for me

Update-hacking benches. Controlled, longitudinal tasks where \(h_t\) is measured, with randomized access to timing, incentives, and assistance; report drift/false-positive dynamics, not just single-turn approval. (Start from the U-Sophistry metrics; extend to longitudinal drift.)
Surrogate-guided influence audits. Fit \(\hat U\) on non-adversarial logs; then (a) search for attack sequences that move \(h_t\) and (b) test defences (budgeting, randomized gating, assistance) against these sequences before deployment.
Assistance safety curves. For critique-assistance, map when/why it helps vs. hurts under adversarial pressure; design assistance-hardening methods that keep boosting detection even when the model optimizes against the assistance itself.
Anti-sycophancy preference modeling. Diagnose which data features drive sycophancy and build PMs that explicitly down-weight belief-matching; compare RL and best-of-\(N\) against baseline vs. “non-sycophantic” PMs.
No-reward generalization probes. Hold back approval signals and study whether performative persuasion persists; this helps separate “learned task competence” from “learned evaluator-shaping.”

7 References

Benton, Wagner, Christiansen, et al. 2024. “Sabotage Evaluations for Frontier Models.”

DeYoung. 2013. “The Neuromodulator of Exploration: A Unifying Theory of the Role of Dopamine in Personality.” Frontiers in Human Neuroscience.

Dezfouli, Ashtiani, Ghattas, et al. 2019. “Disentangled Behavioural Representations.” In Advances in Neural Information Processing Systems.

Dezfouli, Griffiths, Ramos, et al. 2019. “Models That Learn How Humans Learn: The Case of Decision-Making and Its Disorders.” PLOS Computational Biology.

Dezfouli, Nock, and Dayan. 2020. “Adversarial Vulnerabilities of Human Decision-Making.” Proceedings of the National Academy of Sciences.

Everitt, Hutter, Kumar, et al. 2021. “Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective.”

Greenblatt, Denison, Wright, et al. 2024. “Alignment Faking in Large Language Models.”

Laine, Chughtai, Betley, et al. 2024. “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.”

Peterson, Bourgin, Agrawal, et al. 2021. “Using Large-Scale Experiments and Machine Learning to Discover Theories of Human Decision-Making.” Science.

Robertazzi, Vissani, Schillaci, et al. 2022. “Brain-Inspired Meta-Reinforcement Learning Cognitive Control in Conflictual Inhibition Decision-Making Task for Artificial Agents.” Neural Networks.

Saunders, Yeh, Wu, et al. 2022. “Self-Critiquing Models for Assisting Human Evaluators.”

Sharma, Tong, Korbak, et al. 2025. “Towards Understanding Sycophancy in Language Models.”

Wen, Zhong, Khan, et al. 2024. “Language Models Learn to Mislead Humans via RLHF.”

Footnotes

I have a hard time selling this in rationalist circles, where people have spent considerable effort learning the skills of fact-based argumentation, and they forget that it is not a baseline human capability. In fact, I would argue people overestimate their corrigibility in rationalist circles, but that is a story for another time.↩︎