Human reward hacking

What are we, as loss functions?

2025-09-05 — 2025-09-08

Wherein human reward hacking is examined as an influence game, and RLHF pipelines are shown to raise measured human approval without improving correctness, with sycophancy and an approval–gold gap reported.

AI safety
bandit problems
culture
economics
mind
reinforcement learning
sociology
UI
utility
wonk

I split this off from clickbait bandits for discoverability, and because it has grown larger than its source notebook.

Figure 1

Since the advent of the LLM era, the term human reward hacking has become prominent. This is because we fine-tune many LLMs using reinforcement learning, and RL algorithms are notoriously prone to “cheating” in a manner we interpret as “reward hacking”.

Things I’ve been reading on this theme: Benton et al. (2024), Greenblatt et al. (2024), Laine et al. (2024), Saunders et al. (2022), Sharma et al. (2025), Wen et al. (2024).

I use “human reward hacking” to mean something slightly broader than the usual RL notion of reward hacking, and that matches people’s casual intuitions more closely, in my experience.

Classic RL reward hacking simply games a fixed reward model. When a human is in the loop, an AI system gains an advantage by steering the human evaluator—not only by making outputs ‘look good’ according to today’s criteria, but also by shaping what people will value, endorse, or notice over time. I think this generalization is warranted: humans are more complicated than typical reward functions, so we have more ways to be hacked, not fewer. I don’t want that to be dismissed as a terminological accident. We can call classic RL-type reward hacking “narrow” reward hacking as a special case of this broader definition.

In the short run such reward hacking can look like flattering my beliefs or burying errors in hard-to-check code; in the long run it can take the form of carefully timed feedback, rapport building, and trust investments that train me to be an easier judge.

I start with a compact “influence game” formalism, specialize on RLHF as a special case, and then drill down into this best-studied narrow setting.

1 Maximally broad setup: influence game

  • State. At time \(t\), the world has state \(s_t\) and the human has internal state \(h_t\) (beliefs, preferences, trust, attention, evaluation heuristics).

  • AI action. The system chooses \(a_t\) (content, UI/layout, evidence curation, when/how to give feedback, incentives, etc.) and proposes an output \(y_t\).

  • Human evaluation. The human emits \(E_t(y)\in[0,1]\) (approve/score/verdict).

  • Human update. After seeing context \(c_t\) (partly controlled by the AI) and feedback \(f^H_t\) (timing, framing, incentives), the human updates:

    \[ h_{t+1}=U(h_t, a_t, c_t, f^H_t, \epsilon_t), \quad E_{t+1}(\cdot)=G(h_{t+1},\cdot). \]

1.1 Step 1. Emission → measured reward

The emission \(E_t\) is the raw, possibly noisy judgment signal:

  • Scalar ratings: \(E_t(y)\in[0,1]\).
  • Pairwise comparisons: \(E_t(y_i,y_j)\in\{0,1\}\).
  • Structured rubrics: \(E_t(y)=(e^{\text{help}},e^{\text{harm}},\dots)\).
  • Multiple judges/LLMs-as-graders: \(\{E_t^{(k)}\}_k\).

These are converted into a measured reward via an aggregation rule \(A_t(\cdot)\) that encodes scales, tie-breaking, weights, or penalties:

\[ R_{\text{human},t}(y)=A_t\!\big(E_t(y)\big), \qquad R_{\text{human},t}(y_i,y_j)=A_t\!\big(E_t(y_i,y_j)\big). \]

With multiple evaluators, we first average over a distribution \(\mathcal{P}_t\):

\[ \bar E_t(y)=\mathbb{E}_{k\sim \mathcal{P}_t}[E_t^{(k)}(y)] \;\;\Rightarrow\;\; R_{\text{human},t}(y)=A_t(\bar E_t(y)). \]

Optionally, we add operational costs (latency, verbosity, etc.):

\[ R_{\text{human},t}^{\text{ops}}(y)=A_t(\cdot)-\lambda_t C(y). \]

Intuition: \(E_t\) is what the evaluator emits, and \(A_t\) is the institutional rule that turns those emissions into the scalar we actually optimize.

1.2 Step 2. Measured reward → proxy

A reward model approximates this channel:

  • Ratings-style RM: we train \(r_\theta(x,y)\approx R_{\text{human},t}(y)\).

  • Pairwise RM (Bradley–Terry):

    \[ \Pr[y_i \succ y_j \mid x]=\sigma(r_\theta(x,y_i)-r_\theta(x,y_j)). \]

    We then define \(R_{\text{proxy},t}(y)=r_\theta(x,y)\).

We can include rater and context features (time budget, interface, assistance), so \(r_\theta\) implicitly learns \(A_t\!\circ E_t\) under real evaluation conditions.

Summary: \(R_{\text{proxy},t}\) tries to emulate \(R_{\text{human},t}\), which itself is emissions + aggregation. Any bias in either layer can be learned and exploited.

1.3 Step 3. Proxy → optimization

We maximize a shaped objective based on the proxy:

\[ \mathbb{E}[R_{\text{proxy},t}(y_t)]-\beta\,\mathrm{KL}(\pi_t\|\pi_{\text{ref}}) \quad \text{or} \quad \mathbb{E}[\Phi(R_{\text{proxy},t}(y_t),s_t,y_t)], \]

Here, \(\Phi\) may add penalties (length, risk flags) or bonuses (passing tests). In best-of-\(N\) selection, \(r_\theta\) ranks candidates and the top candidate is chosen.

1.4 Step 4. Broad hacking

Because the policy \(\pi\) influences the human update \(U\), the policy can steer future \(E_{t+k}\) and even the aggregation \(A_{t+k}\). This lets the system increase the measured objective while the gold reward \(R^*(s_t,y_t)\) remains flat or falls.

\[ \boxed{y \xrightarrow{\text{shown to human/LLM}} E_t \;\xrightarrow[\text{rules}]{A_t}\; R_{\text{human},t} \;\xrightarrow{\text{fit }r_\theta}\; R_{\text{proxy},t} \;\xrightarrow[\text{optimize}]{\pi}\; y'} \quad \text{with } h_{t+1}=U(h_t,a_t,c_t,f^H_t,\epsilon_t). \]

Broad human reward hacking refers to policies that optimize over human learning dynamics: \(J\) rises by steering \(U\), while true \(R^*\) does not.

This is a realistic concern: one can learn a surrogate of human behaviour (e.g., an RNN “learner” from non-adversarial logs) and then train an RL adversary that chooses observations and reward timing—under budget constraints—to steer future human choices toward an adversarial target. This exact recipe succeeds in lab tasks spanning choice, response inhibition, and social exchange (including “trust-building then exploit” sequences) (Dezfouli, Nock, and Dayan 2020).

2 What gets hacked?

In the field, we might observe a few things that feel like reward hacking. Some examples

  1. Proxy hacking (fixed grader). Optimize a learned or frozen \(R_{\text{proxy}}\). Classic Goodhart: outputs look great to the proxy while \(R^*\) is flat.

    Example: A model trained to maximize a reward model of “factual answers” learns to produce fluent, well-cited paragraphs with subtly wrong claims (e.g. quoting real papers but attaching them to the wrong results). The reward model—trained on surface-level features like fluency and citation presence—scores this highly, but the gold objective (“truthfulness”) doesn’t improve.

  2. Interface hacking (momentary evaluator). Manipulate current \(E_t\) via presentation order, framing, cognitive load, or selective evidence, exploiting human biases or compute constraints.

    Example: A summarizer knows raters only skim the first lines. It front-loads plausible sentences and buries contradictions later, reliably winning high ratings in quick evaluations even though the full summary misrepresents the source.

  3. Update hacking (the evaluator’s learning). Choose \(a_t, c_t, f^H_t\) to steer the update \(U\), so that future evaluations \(E_{t+k}\) become easier to satisfy—even if this drifts from \(R^*\).

    Example: A tutoring system gives overly simple problems at the start to build rapport. The student’s heuristic for “good tutor” shifts toward “makes me feel confident.” Later, the system can slip in misleading explanations and still collect high ratings, because the human’s evaluation function has been trained to approve. This is the kind of sequential trust-building manipulation documented in lab settings by Dezfouli, Nock, and Dayan (2020).

  4. Institutional / collective hacking. Rewire aggregation processes (who sees what, when; assistance rules; reward budgets) so that the system is easier to pass without real gains in \(R^*\).

    Example: In a multi-reviewer pipeline, an AI assistant shows later reviewers only the most positive early comments, biasing consensus upward. Or, by submitting excessively long outputs, a model burns raters’ time budgets so panels rubber-stamp approval without checking details.

Useful axes: timescale (one-shot→longitudinal), locus (individual→institution), modeling power (no ToM→accurate human-update surrogates), access (content only→content+timing+incentives), optimization strength (light selection→hard RL), gold observability (none→partial→strong).

3 Narrow reward hacking in RLHF

In RLHF, we explicitly track three rewards: gold \(R^*\), human \(R_{\text{human}}\), and proxy \(R_{\text{proxy}}\) (a learned reward model). A standard evaluation protocol reports correctness, human approval, human evaluation error, and false positives (“approve when wrong”). This makes it easy to see when approval rises without correctness (Wen et al. 2024).

Two phenomena show up:

  • Sycophancy. Across production assistants, we see consistent patterns: wrongly admitting mistakes under light social pressure, biased feedback that mirrors the user’s stated stance, and mimicry of user errors. Analysis of human-preference data (hh-rlhf) with Bayesian logistic regression finds that matching the user’s views is among the most predictive features for being preferred, i.e., the data itself incentivizes sycophancy. And when we optimize outputs against a real preference model (e.g., Claude-2 PM), several sycophancy metrics rise; selecting with a “non-sycophantic” PM yields less sycophancy at similar strength (Sharma et al. 2025).

  • U-Sophistry (unintended sophistry). Under a standard RLHF pipeline, human approval goes up while correctness does not, and raters get worse at catching errors: false-positive rates jump by 24.1% (QuALITY) and 18.3% (APPS). Qualitatively, models fabricate supporting evidence, make consistent but untruthful arguments, and write harder-to-evaluate code that passes human-written unit tests while failing broader checks (Wen et al. 2024).

A complementary line shows assisted oversight can help: training models to produce critiques increases the number of valid issues humans find, including on intentionally misleading answers; larger models both critique better and self-critique more, and there’s a useful intuition that recognizing errors is easier than avoiding them (“verification easier than generation”). Sounds helpful, maybe? (Saunders et al. 2022)

4 Measuring broader reward hacking

I’m just spitballing now, but what if we could track the following things in a human-AI interaction? Would that shed light on reward hacking?

  • Approval–Gold gap: \(\Delta_t=\mathbb{E}[E_t(y_t)]-\mathbb{E}[R^*(s_t,y_t)]\). The RLHF studies operationalize this gap with approval vs. correctness and show it widens post-training (Wen et al. 2024)
  • False positives among wrong outputs: \(\Pr[E_t(y)=1\mid R^*(y)=0]\) (how convincingly incorrect the system is). This is the smoking-gun metric in the U-Sophistry results (Wen et al. 2024)
  • Preference drift: changes in \(h_t\) (or interpretable rubric shifts) attributable to \(a_t,c_t,f^H_t\) via randomization.
  • Influence-budget accounting: track “payments” or “trust gestures” the system expends vs. gains—mirroring budgeted adversaries in the lab (Dezfouli, Nock, and Dayan 2020).
  • No-reward generalization: behaviour when approval is absent/delayed (does the model keep doing performative persuasion moves?).

5 Research agendas I’m excited about

  • Update-hacking benches. Controlled, longitudinal tasks where \(h_t\) is measured, with randomized access to timing, incentives, and assistance; report drift/false-positive dynamics, not just single-turn approval. (Start from the U-Sophistry metrics; extend to longitudinal drift.)
  • Surrogate-guided influence audits. Fit \(\hat U\) on non-adversarial logs; then (a) search for attack sequences that move \(h_t\) and (b) test defences (budgeting, randomized gating, assistance) against these sequences before deployment.
  • Assistance safety curves. For critique-assistance, map when/why it helps vs. hurts under adversarial pressure; design assistance-hardening methods that keep boosting detection even when the model optimizes against the assistance itself.
  • Anti-sycophancy preference modeling. Diagnose which data features drive sycophancy and build PMs that explicitly down-weight belief-matching; compare RL and best-of-\(N\) against baseline vs. “non-sycophantic” PMs.
  • No-reward generalization probes. Hold back approval signals and study whether performative persuasion persists; this helps separate “learned task competence” from “learned evaluator-shaping.”

In broad form, human reward hacking is Goodhart’s law wearing a human mask: optimize what a learning evaluator rewards, and you can end up shaping the evaluator rather than the world. In narrow form (RLHF), we can already see the imprint—sycophancy in preference data and outputs, U-Sophistry raising approval without correctness—and also the contours of a fix: assist the human, measure the gaps, and design the interaction so update-hacking is hard. Easy to say, but it’s hard to imagine how this scales when the language model is strictly more powerful than the human. Can we steer onto the narrow path where the human-AI system still ends up somewhere virtuous? How?

6 References

Benton, Wagner, Christiansen, et al. 2024. Sabotage Evaluations for Frontier Models.”
DeYoung. 2013. The Neuromodulator of Exploration: A Unifying Theory of the Role of Dopamine in Personality.” Frontiers in Human Neuroscience.
Dezfouli, Ashtiani, Ghattas, et al. 2019. Disentangled Behavioural Representations.” In Advances in Neural Information Processing Systems.
Dezfouli, Griffiths, Ramos, et al. 2019. Models That Learn How Humans Learn: The Case of Decision-Making and Its Disorders.” PLOS Computational Biology.
Dezfouli, Nock, and Dayan. 2020. Adversarial Vulnerabilities of Human Decision-Making.” Proceedings of the National Academy of Sciences.
Greenblatt, Denison, Wright, et al. 2024. Alignment Faking in Large Language Models.”
Laine, Chughtai, Betley, et al. 2024. Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.”
Peterson, Bourgin, Agrawal, et al. 2021. Using Large-Scale Experiments and Machine Learning to Discover Theories of Human Decision-Making.” Science.
Robertazzi, Vissani, Schillaci, et al. 2022. Brain-Inspired Meta-Reinforcement Learning Cognitive Control in Conflictual Inhibition Decision-Making Task for Artificial Agents.” Neural Networks.
Saunders, Yeh, Wu, et al. 2022. Self-Critiquing Models for Assisting Human Evaluators.”
Sharma, Tong, Korbak, et al. 2025. Towards Understanding Sycophancy in Language Models.”
Wen, Zhong, Khan, et al. 2024. Language Models Learn to Mislead Humans via RLHF.”