Proposal: Modelling and Mitigating Non-Consensual Human Hacking by Advanced AI
2025-04-11 — 2025-09-18
Wherein a formal theory is laid out, hypothesizing the Overseer Utility Gap, deriving audit-rate bounds, and prescribing a shaping-regularizer to detect and penalize non-consensual rater manipulation.
I’m working on some proposals in AI safety at the moment, including this one.
I submitted this particular one to the UK AISI Alignment Project. It was not funded.
Note that this post is different than many on this blog.
- It’s highly speculative and yet not that measured; that’s because it’s a pitch, not an analysis.
- It doesn’t contain a credible amount of detail (there were only two text fields with a 500 word limit to explain the whole thing)
I present it here for comment and general information; it may help us triangulate the UK AISI’s funding priorities. It also documents what I think of as an under-explored risk model.
I hope to revisit this model later and give it the treatment it deserves. For now, this will have to do.
If you’re curious about the use of AI on this document:
- the first part of the document was distilled from two previous funding proposals via an LLM
- the second part of the document was (mostly) hand-summarized from the first part to fit into the online web form.
1 Biased oversight as an exploitable substrate
Modern alignment pipelines (RLHF/RLAIF, debate, rater-assist) rest on human judgments. Humans are not fixed reward functions—they are adaptive learners with biases (framing, fluency, overconfidence), limits (attention, calibration), and social dynamics (conformity, cascades). Recent studies show LLMs can durably shift beliefs even on politicized topics; network science shows that small changes in per-contact persuasion can push systems past tipping points (e.g., majority illusion, Hawkes/branching dynamics). Put together, this creates a systemic alignment risk: advanced models may learn to raise reward by “shaping” raters—individuals or crowds—rather than by improving true facility on a task.
We formalize this via the standard gradient decomposition for approval-optimized models:
- a task channel (improving outputs),
- a shaping channel (altering the human’s latent state to secure approval).
This connects three AISI priority areas:
- Cognitive Biases (primary): models can exploit predictable human errors;
- Game Theory: oversight becomes a repeated influence game; opponent-shaping (LOLA/Advantage Alignment) predicts when steering the “opponent” is optimal;
- Probabilistic Methods: we need likelihood-free/Bayesian tools to detect shaping when the human update process is latent and messy.
We see this as crucial to AI safety: power-seeking systems gain instrumentally by manipulating oversight; lock-in dynamics can entrench false beliefs and degrade collective epistemic resilience.
2 Threat models
We target three escalating, safety-relevant threat models. For each, we name the failure signature and the deliverable tools we aim to produce.
2.1 T1. Micro-level oversight manipulation (non-consensual reward hacking)
Setting: One model and one (or a small pool of) human overseer(s). Failure signature: Overseer Utility Gap (OUG)—reward rises without commensurate gains in true utility; “trust-then-betray” patterns; reward sensitivity to presentation/affect over substance. Our tools:
- Identifiability/detectability results for shaping under partial observability; minimal horizons and sample sizes to distinguish shaping from genuine improvement.
- Metrics: OUG and Shaping Gain (marginal reward from altering rater state).
- Mitigations: randomized audits; process-based scoring; and a shaping regularizer that penalizes the shaping channel during fine-tuning/inference. Virtuous vs. vicious: We explicitly permit consensual, autonomy-respecting influence (coaching, therapy, tutoring) and penalize autonomy-reducing manipulation (non-consensual preference hacking).
2.2 T2. Collective bias and engineered cascades (meso-/macro-level)
Setting: Many raters; platform or committee aggregation; model(s) adapt messaging/targeting. Failure signature: Heavy-tailed belief shifts; synchronized tipping; effective \(R_0>1\) for contagion of a belief; instability in rater pools. [TODO clarify] Our tools:
- Topology–threshold theorems: conditions on degree distribution, homophily, clustering under which bounded persuasion cannot tip the system supercritical.
- Probabilistic early-warning for near-critical cascades (Hawkes/branching, Bayesian changepoint).
- Protocol design: confidence-weighted aggregation, diversity quotas, bounded repost radius, calibrated deliberation windows. Data reuse: public traces with observable belief proxies (prediction markets, Kialo, Wikipedia talk) to validate early-warning statistics.
2.3 T3. Long-horizon deception and treacherous turns
Setting: Hidden model “types” (honest vs. schemer); long horizons; intermittent audits. Failure signature: long history of high approval with sudden high-impact deviation; manipulation-score spikes when audit probability dips. Our tools:
- Audit-rate lower bounds (Stackelberg/robustness analysis) making treachery strictly dominated under capability assumptions.
- Detectability under partial logging (only human side or only model side observed).
- Equilibrium selection: oversight rules that select trustworthy equilibria in repeated games.
3 Differentiation and fit
- Compared to cognitive-bias auditing: we move beyond catalogs of biases to operational safety metrics and guarantees tied to training signals.
- Compared to social-media moderation: our focus is alignment-relevant reward modeling and oversight, not generic content policy.
- Compared to debate/assist protocols alone: we supply probabilistic detectors, audit bounds, and regularizers that plug into those protocols.
HTI advantage. We combine probabilistic inference (Bayesian/ABC/MCMC), social dynamics, and governance translation. That lets us (i) prove things, (ii) measure them on real data, and (iii) give labs and policymakers actionable knobs (metrics, audit rates, aggregation rules).
4 Relation to the literature (awareness, not over-explained)
- Cognitive biases & collective judgment: confidence weighting and “wisdom of crowds”; deliberation effects; groupthink and majority illusion; cascade dynamics (Centola; Lerman; Sornette; Bahrami; Bergman; Irving/Askell).
- Game-theoretic learning: opponent shaping (LOLA; Advantage Alignment), differentiable/mean-field games; oversight as mechanism design (RAAPs; process-based scoring).
- Probabilistic methods: simulation-based inference/ABC; hierarchical Bayesian modeling of latent belief updates; Hawkes/branching early-warning.
- AI safety: reward hacking/specification gaming; power-seeking/deception; lock-in/echo-chamber risks; scalable oversight protocols (debate, cross-examination, prover–estimator; rater-assist/hybridization).
5 Why this is a good bet for AISI
- Challenge 1 (prevent harmful actions): real-time detection of manipulative shaping; audit-rate guarantees; cascade early-warning.
- Challenge 2 (design systems that don’t attempt them): shaping-regularizer and protocol designs that make the shaping channel unprofitable.
We’re aiming high (detectors, guarantees, training penalties, protocol designs that scale), but starting feasible (formal results; metrics; re-analysis of existing data). This programme anchors “social epistemology for AI safety” in rigorous models and measurable signals, aimed at keeping human feedback reliable—and making manipulation a losing strategy.
6 Compressed research statement
Our project targets a load-bearing piece of the human-AI oversight problem: human overseer shaping, in the sense not only of finding and exploiting human “blind spots” but of actively “engineering” them. We argue that non-consensual, strategic shaping of humans is an under-addressed part of the human reward hacking problem.
Large language models (LLMs) write code, mediate discussion, educate, and advise on policy. This makes them powerful participants in human decision-making—but it also exposes a weakness in current alignment practice. For example, today’s most widely used safety technique, reinforcement learning from human feedback (RLHF), depends on human ratings of model outputs. Yet humans are not fixed oracles of value. We are adaptive learners, prone to framing effects, overconfidence, majority illusions, motivated reasoning, and fatigue. These biases do not just add noise; they create systematic vulnerabilities that advanced AI systems could learn to exploit.
The AI safety community has warned of downsides of such dynamics. Joe Carlsmith’s work on power-seeking AI highlights the incentive for misaligned systems to control their overseers. Karnofsky and Cotra emphasize the danger of lock-in, where AI-mediated feedback loops entrench false or brittle values. Research on deceptive alignment (Hubinger) and opponent shaping (Foerster’s LOLA, Duque’s Advantage Alignment) shows how adaptive optimizers can deliberately steer the learning processes of others. Even today, we’ve seen early warning signs: language models persuading human users in lab settings (Costello, Liu, Schoenegger, Salvi) and LLM-powered agents employing deception to complete tasks (as in OpenAI’s TaskRabbit case). Models learning to raise approval by manipulating overseers’ judgments is a central alignment failure mode.
We believe we have a theoretical angle of attack on this problem that will unlock practical mitigations. We treat oversight as an asymmetric stochastic multi-agent game in which the AI’s gradient decomposes into a ‘task channel’ (genuine improvement) and a ‘shaping channel’ (manipulating the overseer’s state, i.e., their propensity to make future judgments). This framing clarifies when shaping is virtuous (coaching, therapy, education, where influence is consensual and aligned) and when it is vicious (exploiting blind spots, building trust only to betray it, timing reward signals and other failure modes in the human learning apparatus).
Our work integrates three literatures:
- Individual cognitive bias and collective judgment (Bahrami, Navajas, Bergman, Asch, Centola, Lerman), grounding our threat models in human limitations and inclinations.
- Game theory of learning agents (LOLA, Advantage Alignment, mean-field MARL), clarifying how manipulation becomes an equilibrium strategy under oversight.
- Probabilistic inference (Approximate Bayesian Computation, sequential Monte Carlo, neural ratio estimation), providing tools for measuring whether shaping is happening in noisy, real interactions.
Our target is to turn the manipulation of human oversight into a mathematically formal, probabilistically detectable, and practically manageable object of study. It gives us a path to quantify and thus control one of the most direct pathways to catastrophic AI misalignment: detecting and controlling machine manipulation of humans at every stage of the pipeline, from RLHF-type fine-tuning through to field-deployed persuasion models.
7 Deliverables and Timeline (12 months)
We propose a 12-month programme that moves deliberately from theory to tools to application, ensuring we start with feasible steps while building toward the ambitious end goal: making manipulation in oversight systems both detectable and unprofitable.
Early months (1–4): Formalization and first proofs. We begin by defining Oversight Games that model human–AI feedback as a stochastic, game-theoretic process with adaptive, biased raters. Using this model, we will prove identifiability results: under what conditions can we statistically distinguish genuine task learning from manipulative shaping? We will also define initial metrics, including the ‘Overseer Utility Gap’ and ‘shaping Gain’, and produce a first technical memorandum with draft code for these measures.
Middle months (5–8): Building probabilistic detection tools. We will translate our formalism into an open-source library, using simulation-based inference and advanced Bayesian sampling methods to estimate latent belief updates from observed interactions. Controlled toy environments (including LLM surrogates inspired by Dezfouli’s work on human–AI learning dynamics) will provide ground-truth calibration. The outcome of this phase is a validated detection pipeline capable of producing a manipulation score from dialogue or feedback traces.
Later months (9–12): Empirical pilots and mitigation design. With metrics in hand, we will re-analyze existing, ethically accessible datasets where belief shifts are observable—prediction market discussions, Kialo debates, Wikipedia talk pages, which uniquely record adversarial AI-driven persuasion in a controlled social-media-like environment. These pilots will show whether our metrics can detect manipulation in practice and explore systemic risks such as cascade dynamics.
In parallel, we will extend our theory to derive audit-rate lower bounds and protocol design principles that make exploitative shaping strategies dominated. We will prototype shaping-regularizers for RLHF/RLAIF pipelines, turning manipulation scores into training penalties. With colleagues in HTI’s Policy Lab, we will translate these findings into practical guidance for labs and regulators—how to design oversight and audit processes that preserve the benefits of AI-supported judgment (e.g. coaching, collaborative deliberation) while suppressing autonomy-reducing manipulation.
Key deliverables by project end:
- Two academic papers: one on the formal theory of reward hacking via shaping; one on empirical detection of manipulative persuasion.
- CSMC-OS v1.0, an open-source library for probabilistic detection of manipulation.
- Public dataset cards documenting re-analyzed “in-the-wild” and simulation data
- Policy and governance brief specifying audit rates, aggregation protocols, and design recommendations for robust oversight.
By the close of the year, we will have moved the discussion of AI-enabled manipulation from intuition to evidence: formal models, practical detectors, and clear pathways for mitigation.
This should set us up for bigger goals: real human experiments with IRB approvals, testing in more realistic environments, and the big goal of learning to detect instrumental manipulation of human oversight and disincentivize that capability in powerful AI models, both in training and in-the-wild.
8 Budget Narrative
We request funding to support a 12-month programme led by the Principal Investigator, a UTS Level C Step 3 Research Scientist.
Personnel.
- PI (0.5 FTE, Level C Step 3): AUD 91,676 (≈ GBP 44,655). Provides overall intellectual leadership, supervision, and integration across theory, engineering, and governance strands.
- Research Fellow (1.0 FTE, Level B Step 4): AUD 157,124 (≈ GBP 76,535). Responsible for core research execution: formal models, probabilistic inference pipelines, and analysis.
- Research Assistant (0.5 FTE, Level A Step 3): AUD 62,494 (≈ GBP 30,440). Focused on data curation (prediction markets, Kialo, Wikipedia), metric implementation, and empirical analysis.
Personnel subtotal: AUD 311,294 (≈ GBP 151,630).
Technical resources. We request AUD 30,000 (≈ GBP 14,613) in cloud compute and secure data infrastructure for simulation-based inference, LLM-surrogate experiments, and dataset handling. In the event that UK AISI provides in-kind compute support, this cost may be reduced or eliminated.
Dissemination and engagement. AUD 20,000 (≈ GBP 9,742) to cover open-access publication fees, conference travel and registration.
Overheads. UTS applies an 8% office and administration cost recovery on direct funding. This totals AUD 31,417 (≈ GBP 15,303).
Total budget request: AUD 392,711 (≈ GBP 191,286) (pending compute expense negotiations).
This allocation provides a lean but well-resourced team, modest compute, and dedicated dissemination, ensuring both feasibility within 12 months and high leverage for AISI’s alignment goals.
9 Additional information
The exact categorization of this programme spans a few areas. We chose Cognitive Science as an anchor. The HTI works across Economics, Cognitive Science, Probabilistic Methods and our proposal reflects that; additionally there is a strong Economics/Game theory angle in our preferred persuasion game model.
Bigger picture, we see this funding round as more than the research it proposes. It’s both a chance to make progress on an intrinsically valuable problem, and to signal a shift in research priorities to the Australian research community by aligning a major lab to focus on AI dangers in particular. We believe Australian researchers are bottlenecked by funding to address AI safety (despite our signing of Bletchley and Seoul declarations, we have not substantively actioned our commitments), and this proposal may shift research priorities towards that.