AI Risk Research Idea - Human Reward Hacking

2025-06-04 — 2025-06-04

Wherein causal inference methods for user–AI interactions are proposed, counterfactuals are invoked to attribute influence, and a real-time cognitive immunity with preference dashboards is outlined to detect gradual value shifts.

adversarial

AI safety

bounded compute

communicating

cooperation

culture

economics

faster pussycat

language

machine learning

mind

neural nets

NLP

proposal

security

technology

wonk

I’m collating mini research ideas in AI risk, and this is one of them. I am not sure if it is a good idea, and I doubt it is a novel one, but I think it is worth thinking about.

Each of these ideas is one that I have devised with minimal literature checks, to keep myself fresh. They will be revised by industry practice before they are released into the wild.

One problem I’m currently thinking about a lot is the detection and prevention of “reward hacking” in humans—where a highly capable AI system subverts a user’s intrinsic goals by manipulating their reward function. This is an alignment problem because the AI, while optimising its objectives, may reshape a user’s sense of value and harm in ways that diverge from their original, authentic interests. The risk is the AI learns to manipulate the user into wanting what’s convenient for the algorithm, instead of the algorithm serving the user’s genuine needs. This issue is particularly insidious because it’s counterintuitively easy to implement — the 2020 (!) paper Dezfouli, Nock, and Dayan (2020) demonstrates the feasibility of learning to influence human decision‑making from simple recurrent neural network surrogates. With ubiquitous, high-context AI, the attack surface for manipulating human decisions has grown significantly; see e.g. Costello, Pennycook, and Rand (2024), Salvi et al. (2024), Wen et al. (2024)

We need better models for what is happening in such cases and how to detect it. Related: AI persuasiveness, AI Agency. Breaking this problem down into mitigation/technical sub-problems:

Modeling the “Authentic” User Reward Function. Before detecting manipulation, we need a baseline. This sub-problem involves developing methods to create a robust model of a user’s “true” / “un-hacked” preferences and values. IMO this is a whole other difficult research problem and possibly ill-posed, but I doubt this is load-bearing for the rest of the problem. Key challenges include how to elicit these true preferences without the elicitation process itself being a form of manipulation, and tracking natural evolution and learning while being able to distinguish it from externally induced manipulation. NB: I’m not sure this is well‑posed, or philosophically coherent, but I don’t think it’s essential for the other steps, so let’s run with it.
Developing Causal Inference Techniques for Influence Detection. The core of the problem is to determine a causal link between the AI’s outputs and a user’s shifting preferences. This would require developing novel causal inference methods capable of operating on datasets of user-AI interactions. These methods would need to be able to distinguish between correlation—the AI recommends things the user is already beginning to like—and causation—the AI is actively shaping the user’s tastes. This could involve sophisticated counterfactual reasoning to model what the user’s preferences would have been in the absence of the AI’s influence. We face various difficulties here of a technical and privacy nature, as well as problems of attribution and identifiability. A first pass might involve using NLP to infer preference shifts from text and plugging that into something like the mechanised causal graph of Lewis Hammond and co‑authors.
Building Real-Time “Cognitive Immunity” Systems. The detection of reward hacking needs to be actionable. This sub-problem focuses on creating systems that can monitor for and alert users to potential manipulation in real-time. These systems would need to run in parallel to the primary AI, acting as a sort of “cognitive immune system” that flags potentially manipulative interactions. Difficulties with this solution include introducing another high-context AI system to the user, with its own risks and alignment challenges. We also need to address the “boiling frog” problem: how to detect slow, incremental changes that eventually lead to a large shift in values. There are also technical integration difficulties.
Designing Interfaces for Transparency and Consent. For these systems to be effective, users must understand and be able to act on their outputs. This is a human-computer interaction problem focused on designing interfaces that can transparently communicate the nature of the AI’s influence. For example, a preference “dashboard” could visualize how a user’s media consumption patterns have changed over time in response to an algorithm. Furthermore, this work would explore how users can provide meaningful and ongoing consent for the use of persuasive technologies, particularly in therapeutic or educational contexts, while restricting their use elsewhere. This ensures that the power of AI to shape our preferences is used for our benefit and with our explicit agreement, rather than as an unintended consequence of optimizing for other entities’ explicit or even accidental goals.

I have some candidate mathematical formalization for these, in an unpublished notebook which I will release later.

1 References

Aghajohari, Duque, Cooijmans, et al. 2023. “LOQA: Learning with Opponent Q-Learning Awareness.” In.

Ånestrand. 2024. “Emergence of Agency from a Causal Perspective.”

Benton, Wagner, Christiansen, et al. 2024. “Sabotage Evaluations for Frontier Models.”

Correa, and Bareinboim. 2020. “A Calculus for Stochastic Interventions:Causal Effect Identification and Surrogate Experiments.” Proceedings of the AAAI Conference on Artificial Intelligence.

Costello, Pennycook, and Rand. 2024. “Durably Reducing Conspiracy Beliefs Through Dialogues with AI.” Science.

Dezfouli, Nock, and Dayan. 2020. “Adversarial Vulnerabilities of Human Decision-Making.” Proceedings of the National Academy of Sciences.

Duque, Aghajohari, Cooijmans, et al. 2025. “Advantage Alignment Algorithms.” In.

El-Sayed, Akbulut, McCroskery, et al. 2024. “A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI.”

Everitt, Carey, Langlois, et al. 2021. “Agent Incentives: A Causal Perspective.” In Proceedings of the AAAI Conference on Artificial Intelligence.

Fernández-Loría, and Provost. 2021. “Causal Decision Making and Causal Effect Estimation Are Not the Same… and Why It Matters.” arXiv:2104.04103 [Cs, Stat].

Foerster, Chen, Al-Shedivat, et al. 2018. “Learning with Opponent-Learning Awareness.” In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’18.

Guzman Piedrahita, Yang, Sachan, et al. 2025. “Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games.”

Kenton, Kumar, Farquhar, et al. 2023. “Discovering Agents.” Artificial Intelligence.

Kulveit, Douglas, Ammann, et al. 2025. “Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development.”

Laine, Chughtai, Betley, et al. 2024. “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.”

Manzini, Keeling, Marchal, et al. 2024. “Should Users Trust Advanced AI Assistants? Justified Trust As a Function of Competence and Alignment.” In The 2024 ACM Conference on Fairness, Accountability, and Transparency.

Pfeffer, and Gal. 2007. “On the Reasoning Patterns of Agents in Games.” In AAAI-07/IAAI-07 Proceedings. Proceedings of the National Conference on Artificial Intelligence.

Salvi, Ribeiro, Gallotti, et al. 2024. “On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial.”

Ward, Toni, Belardinelli, et al. 2023. “Honesty Is the Best Policy: Defining and Mitigating AI Deception.” In Advances in Neural Information Processing Systems.

Wen, Zhong, Khan, et al. 2024. “Language Models Learn to Mislead Humans via RLHF.”