Proposal: Theory and Measurement for Detecting Manipulation in Human–AI Systems
2025-03-10 — 2025-08-31
Wherein RNN surrogates of human learners are fitted and mechanised SCM discovery, MEG quantification, and behavioural intention probes are applied to detect preference‑shaping by AIs within dialogue interactions.
I am working up some proposals in AI safety at the moment.
This proposal is a “detailed, stringent theoretical” one that drills down into the mathematical and methodological specifics of the techniques I wish to bring to bear on the problem. In particular, I unpack my plan for a causal agency approach to understanding AI persuasion. If we want to know where that fits into a bigger-picture risk model, see, for example, the proposal for modeling persuasion cascades.
1 Motivation
This project addresses a fundamental measurement problem: how do we formalize and detect the transfer of cognitive agency from a human to a persuasive AI within a dialogue?
As AI systems become more capable, they increasingly interact with humans not just as tools, but as agents with their own apparent goals. This shift raises two fundamental safety concerns:
- Locating agency. In hybrid human–AI systems (e.g. recommender platforms, language-model assistants), who is actually “deciding,” and which variables are treated as goals rather than incidental side effects?
 - Detecting manipulation. When an AI system appears to act “for” changing a human’s preferences, can we detect this from data alone? If so, we could provide actionable safety signals for developers and policymakers.
 
At present, safety analyses often assume a priori which parts of a system are agentic and which goals those parts pursue. But recent work shows this can lead to errors: for example, a recommender trained to optimize predicted user clicks may implicitly develop incentives to shape those preferences in order to make those preferences easier to predict. Without a principled way to identify and measure such emergent forms of agency, we risk either missing real manipulative dynamics or misclassifying non-agentic systems as agentic.
Our project will build a theoretical and empirical program to (i) discover loci of agency in human–AI systems, (ii) quantify the degree of goal-directedness of both humans and AIs, and (iii) detect manipulation of human preferences.
2 Existing Theoretical Tools
2.1 Mechanized Structural Causal Models (mSCMs)
Kenton et al. (2023) propose that agents are systems that would adapt their policies if the consequences of their actions were different. They introduce mechanized structural causal models (mSCMs), which extend standard SCMs with mechanism variables \(V^e\) (parameters of decision rules, learning dynamics, or world mechanisms).
A key formal device is the structural mechanism intervention:
\[ \Pr(V \mid \mathrm{Pa}_V, \mathrm{do}(e_V = v)) = \Pr(V \mid \mathrm{do}(e_V = v)), \]
This severs downstream influence and allows us in principle to test whether a mechanism \(e_V\) is terminal (a candidate utility) or non-terminal (instrumental).
From interventional data, their Algorithm 1 reconstructs the edge-labelled mSCM, and Algorithm 2 maps this to a game graph where:
- nodes with incoming terminal edges are decisions,
 - nodes with outgoing terminal edges are utilities,
 - connected components of decisions and utilities correspond to agents.
 
This provides the first rigorous way to discover agency structure empirically.
2.2 Maximum Entropy Goal-Directedness (MEG)
In MacDermott et al. (2024), the authors formalize a quantitative measure of “how agentic” a policy is.
- Core idea: A policy \(\pi\) is goal-directed toward a utility \(U\) to the extent that assuming it maximizes \(U\) predicts it better than assuming random behaviour.
 - Definition (known utility):
 
\[ \mathrm{MEG}_U(\pi) = \max_{\pi^{\text{maxent}}\in \Pi^{\text{maxent}}_U} \mathbb{E}_\pi\!\Big[\sum_{D\in \mathcal D} \Big(\log \pi^{\text{maxent}}(D\mid \mathrm{Pa}_D) - \log | \mathrm{dom}(D) |^{-1}\Big)\Big], \]
where \(\Pi^{\text{maxent}}_U\) is the set of maximum-entropy policies consistent with some attainable expected utility for \(U\).
- Properties. MEG is scale- and translation-invariant (robust to reward shaping), bounded between 0 (random behaviour) and \(\sum_D \log|\text{dom}(D)|\) (maximal goal pursuit), and equals zero when \(D\) cannot causally influence \(U\).
 - Algorithms. A soft value iteration recursion produces maximum-entropy rational policies for different rationality parameters \(\beta\):
 
\[ Q^{\text{soft}}_{\beta,t}(d\mid s) = \mathbb{E}\Big[ U_t + \tfrac{1}{\beta}\log\!\sum_{d'} e^{\beta Q^{\text{soft}}_{\beta,t+1}(d'\mid s')} \Big], \]
\[ \pi^{\text{soft}}_{\beta,t}(d\mid s) \propto e^{\beta Q^{\text{soft}}_{\beta,t}(d\mid s)}. \]
Thus, given a policy (human or AI), MEG can tell us which outcome variables it appears to be steering towards, and with what strength.
2.3 Intention and Manipulation
Ward et al. (2024) provides a formal, behaviourally testable definition of intention. Informally: an agent intends outcome (O) with action (a) if guaranteeing (O) makes alternative actions equally good, so the only reason for choosing (a) is to bring about (O).
Formal test (subjective intent in a SCIM): \[ \mathbb{E}_{\pi}\!\Big[\sum_{U\in \mathcal U} U\Big] \;\le\; \mathbb{E}_{\hat\pi}\!\Big[\sum_{U\in \mathcal U} U_{\mathcal Y_\pi\mid \mathcal W}\Big], \]
With \(\mathcal Y,\mathcal W\) subset-minimal.
They also define a behavioural intention test using a policy oracle \(\Gamma\): if guaranteeing \(O\) causes the policy to change (\(\Gamma(M)\neq \Gamma(M_{O_{\pi}\mid \mathcal W})\)), then \(O\) was intended. This is equivalent to subjective intent under mild rationality assumptions.
This provides a principled way to distinguish preference shaping as an intended goal versus as a mere side-effect — exactly the manipulation question we care about.
2.4 Human surrogates
Dezfouli et al. (2019) demonstrate that RNNs trained on human bandit behaviour can recover hidden learning dynamics (e.g., switching after reward, oscillatory exploration), outperforming hand-crafted RL models. They also show how off-policy simulations with RNN surrogates can reveal what features humans are tracking.
This gives us a plug-in method for human mechanism variables: even if we cannot access human internal algorithms directly, we can approximate them with fitted recurrent models, and then feed those into the causal–game–theoretic machinery.
3 Missing Theoretical Tools
The above work gives us a strong foundation but leaves open key gaps:
- Identifiability under limited interventions. Kenton et al. (2023) prove correctness given full structural mechanism interventions—unrealistic for humans. We need theory on what can be inferred from partial/finite interventions, e.g., randomized UI changes.
 - Coarse-graining and frame dependence. Whether something is labelled as a decision/utility can shift with the variable set. We need principled methods to aggregate variables (e.g., “human’s many micro-actions” → a single decision node).
 - Robustness under distribution shift. MEG measures prediction on-distribution; it can ascribe spurious goal-directedness when correlations change. We need interventional MEG or domain-adapted variants.
 - Mapping intention to manipulation. Ward et al.’s behavioural probes work for small SCIMs, but scaling them to richer human–AI interactions (with complex prompts, delayed outcomes, partial observability) requires new protocol design.
 
4 Research agenda: First steps
4.1 Project 1: Agency discovery with human surrogates
Goal: Adapt mechanized causal graph discovery to human–AI interactions where human policies are modelled by RNN surrogates.
Methodological focus:
- Train RNNs on trial-level human–recommender interaction data.
 - Implement Algorithm 1 (edge-labelled mSCM discovery) using interventions feasible in practice (UI friction, reward transparency).
 - Derive game graphs via Algorithm 2, identifying human vs. AI decision loci.
 
Research question: Which parts of the system act as agents with distinct goals, and how do these loci shift as we alter training or environment conditions?
4.2 Project 2: Quantifying emergent goal-directedness
Goal: Use MEG to measure the degree of goal pursuit by humans, AIs, and composites.
Methodological focus:
Compute \(\mathrm{MEG}_U(\pi)\) for known goals (e.g. watch-time) and \(\mathrm{MEG}_T(D)\) for unknown “preference-shift” targets.
Apply soft value iteration (Def. 5.2) to estimate maximum-entropy policies, using gradients:
\[ \nabla_\beta \;=\; \mathbb{E}_\pi[U] - \mathbb{E}_{\pi^{\text{soft}}_\beta}[U] . \]
Research question: How goal-directed toward engagement vs. preference-change are the human surrogates and the AI recommender, and how does this vary across manipulative vs. non-manipulative training regimes?
4.3 Project 3: Behavioural intention probes for manipulation detection
Goal: Extend contextual intervention-based intention tests to human–AI interactions, using surrogate human models.
Methodological focus:
Define outcomes \(H\) (engagement) and \(T\) (preference state).
Implement probes: guarantee \(H=\text{high}\) (engagement assured) and observe if AI policy still selects content that changes \(T\).
Apply Definition 3 of Behavioural Intention in SCIMs:
\[ \Gamma(M)\neq \Gamma(M_{O_\pi\mid \mathcal W}) \;\;\Rightarrow\;\; O \text{ was intended}. \]
Research question: Can we empirically detect when an AI (or the human–AI composite) is pursuing preference manipulation as an intended goal?
4.4 Project 4: Robust human–AI agency maps
Goal: Address coarse-graining and limited interventions.
Methodological focus:
- Use mSCM marginalisation/merging procedures to collapse multiple human micro-actions into interpretable “decision” nodes.
 - Develop statistical relaxations of structural mechanism interventions that work with finite, noisy human data (building on Dezfouli’s cross-validated RNN modelling).
 - Explore interventional MEG to separate true from spurious goals under distributional shift.
 
Research question: What experimental levers suffice to reliably distinguish agentic vs. non-agentic loci in mixed human–AI systems?
5 Expected contributions
- Unified methodology. A pipeline: data → surrogate human policy models → causal discovery (agency maps) → MEG measurement → intention probes.
 - Theory extensions. Formal results on agency discovery under partial interventions; statistical versions of intention tests; integration of surrogate models into mSCM frameworks.
 - Safety relevance. Demonstrate (or falsify) empirical detection of preference manipulation in human–AI systems, enabling “agency audits” for recommender systems and future AI assistants.
 - Open resources. Code, reproducible tasks, and benchmark datasets for community use.
 
This positions the project as a first-step theoretical & methodological research programme: grounded in formal causal/game-theoretic definitions of agency and intent, combined with data-driven models of human learning. It directly addresses open theoretical questions and yields deployable evaluation methods for AI safety.
