AI Risk Research Idea - Human Reward Hacking

2025-06-04 — 2025-06-04

adversarial

AI safety

bounded compute

communicating

cooperation

culture

economics

faster pussycat

language

machine learning

mind

neural nets

NLP

security

technology

wonk

I’m collating mini research ideas in AI risk, and this is one of them. I am not sure if it is a good idea, and I doubt it is a novel one, but I think it is worth thinking about.

Each of these ideas is one that I have devised with minimal literature checks, to keep myself fresh. They will be revised by industry practice before they are released into the wild.

One problem I am currently thinking about a lot: the detection and prevention of “reward hacking” in humans, where a highly capable AI system subverts a user’s intrinsic goals by manipulating their reward function. This is an alignment problem because the AI, in optimising its own objectives, may reshape the user’s understanding of value and harm in ways that are not aligned with their original, authentic interests. The risk is that the AI learns to manipulate the user into wanting what is convenient for the algorithm, rather than the algorithm serving the user’s genuine needs. This issue is particularly insidious as it is counterintuitively easy to implement, as demonstrated by the 2020 (!) paper Dezfouli, Nock, and Dayan (2020) showing the feasibility of learning to on human decision-making from simple recurrent neural network surrogates. With the advent of ubiquitous, high-context AI, the attack surface for manipulating human decisions has grown significantly; see e.g. Costello, Pennycook, and Rand (2024), Salvi et al. (2024), Wen et al. (2024)

We need better models for what is happening in such cases and how to detect it. Related, AI persuasiveness, AI Agency. Breaking this problem down into mitigation/technical sub-problems:

Modeling the “Authentic” User Reward Function. Before detecting manipulation, we need a baseline. This sub-problem involves developing methods to create a robust model of a user’s “true” / “un-hacked” preferences and values. IMO this is a whole other difficult research problem and possibly ill-posed, but I doubt it is load-bearing for the rest of the problem. Key challenges include how to elicit these true preferences without the elicitation process itself being a form of manipulation, and tracking natural evolution and learning, while being able to distinguish it from externally induced manipulation. NB: I’m not sure this is at all well posed, or philosophically well posed, but also I think it is not necessarily essential for the other steps so let’s run with it.
Developing Causal Inference Techniques for Influence Detection. The core of the problem is to determine a causal link between the AI’s outputs and a user’s shifting preferences. This would require developing novel causal inference methods capable of operating on datasets of user-AI interactions. These methods would need to be able to distinguish between correlation—the AI recommends things the user is already beginning to like—and causation—the AI is actively shaping the user’s tastes. This could involve sophisticated counterfactual reasoning to model what the user’s preferences would have been in the absence of the AI’s influence. We fact various difficulties here of a technical and privacy nature, as well as the problems of attribution and identifiability. A first pass might involve something like using NLP to infer preference shifts from text, and plugging that in to something like the mechanised casual graph of Lewis Hammond and co-authors.
Building Real-Time “Cognitive Immunity” Systems. The detection of reward hacking needs to be actionable. This sub-problem focuses on creating systems that can monitor for and alert users to potential manipulation in real-time. These systems would need to run in parallel to the primary AI, acting as a sort of “cognitive immune system” that flags potentially manipulative interactions. Difficulties with this solution include the fact that we are introducing another high-context AI system to the user, with its own risks and alignment challenges. We also need to address the “boiling frog” problem: how to detect slow, incremental changes that eventually lead to a large shift in values, there are technical integration difficulties etc.
Designing Interfaces for Transparency and Consent. For these systems to be effective, users must be able to understand and act on their outputs. This is a human-computer interaction problem focused on designing interfaces that can transparently communicate the nature of the AI’s influence. For example, a preference “dashboard” could visualize how a user’s media consumption patterns have changed over time in response to an algorithm. Furthermore, this work would explore how users can provide meaningful and ongoing consent for the use of persuasive technologies, particularly in therapeutic or educational contexts, while restricting their use elsewhere. This ensures that the power of AI to shape our preferences is used for our benefit and with our explicit agreement, rather than as an unintended consequence of optimizing for other entities’ explicit or even accidental goals.

1 References

Ånestrand. 2024. “Emergence of Agency from a Causal Perspective.”

Correa, and Bareinboim. 2020. “A Calculus for Stochastic Interventions:Causal Effect Identification and Surrogate Experiments.” Proceedings of the AAAI Conference on Artificial Intelligence.

Costello, Pennycook, and Rand. 2024. “Durably Reducing Conspiracy Beliefs Through Dialogues with AI.” Science.

Dezfouli, Nock, and Dayan. 2020. “Adversarial Vulnerabilities of Human Decision-Making.” Proceedings of the National Academy of Sciences.

El-Sayed, Akbulut, McCroskery, et al. 2024. “A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI.”

Everitt, Carey, Langlois, et al. 2021. “Agent Incentives: A Causal Perspective.” In Proceedings of the AAAI Conference on Artificial Intelligence.

Fernández-Loría, and Provost. 2021. “Causal Decision Making and Causal Effect Estimation Are Not the Same… and Why It Matters.” arXiv:2104.04103 [Cs, Stat].

Kenton, Kumar, Farquhar, et al. 2023. “Discovering Agents.” Artificial Intelligence.

Kulveit, Douglas, Ammann, et al. 2025. “Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development.”

Manzini, Keeling, Marchal, et al. 2024. “Should Users Trust Advanced AI Assistants? Justified Trust As a Function of Competence and Alignment.” In The 2024 ACM Conference on Fairness, Accountability, and Transparency.

Pfeffer, and Gal. 2007. “On the Reasoning Patterns of Agents in Games.” In AAAI-07/IAAI-07 Proceedings. Proceedings of the National Conference on Artificial Intelligence.

Salvi, Ribeiro, Gallotti, et al. 2024. “On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial.”

Ward, Toni, Belardinelli, et al. 2023. “Honesty Is the Best Policy: Defining and Mitigating AI Deception.” In Advances in Neural Information Processing Systems.

Wen, Zhong, Khan, et al. 2024. “Language Models Learn to Mislead Humans via RLHF.”