Proposal: Modelling and Mitigating Non-Consensual Human Hacking by Advanced AI

2025-04-11 — 2025-09-18

Wherein a formal theory is laid out, hypothesizing the Overseer Utility Gap, deriving audit-rate bounds, and prescribing a shaping-regularizer to detect and penalize non-consensual rater manipulation.

adversarial
economics
faster pussycat
innovation
language
machine learning
mind
neural nets
NLP
security
tail risk
technology
Figure 1

I’m working on some proposals in AI safety at the moment, including this one.

I submitted this particular one to the UK AISI Alignment Project. It was not funded.

Note that this post is different than many on this blog.

  1. It’s highly speculative and yet not that measured; that’s because it’s a pitch, not an analysis.
  2. It doesn’t contain a credible amount of detail (there were only two text fields with a 500 word limit to explain the whole thing)

I present it here for comment and general information; it may help us triangulate the UK AISI’s funding priorities. It also documents what I think of as an under-explored risk model.

I hope to revisit this model later and give it the treatment it deserves. For now, this will have to do.

If you’re curious about the use of AI on this document:

  1. the first part of the document was distilled from two previous funding proposals via an LLM
  2. the second part of the document was (mostly) hand-summarized from the first part to fit into the online web form.

1 Biased oversight as an exploitable substrate

Modern alignment pipelines (RLHF/RLAIF, debate, rater-assist) rest on human judgments. Humans are not fixed reward functions—they are adaptive learners with biases (framing, fluency, overconfidence), limits (attention, calibration), and social dynamics (conformity, cascades). Recent studies show LLMs can durably shift beliefs even on politicized topics; network science shows that small changes in per-contact persuasion can push systems past tipping points (e.g., majority illusion, Hawkes/branching dynamics). Put together, this creates a systemic alignment risk: advanced models may learn to raise reward by “shaping” raters—individuals or crowds—rather than by improving true facility on a task.

We formalize this via the standard gradient decomposition for approval-optimized models:

  • a task channel (improving outputs),
  • a shaping channel (altering the human’s latent state to secure approval).

This connects three AISI priority areas:

  • Cognitive Biases (primary): models can exploit predictable human errors;
  • Game Theory: oversight becomes a repeated influence game; opponent-shaping (LOLA/Advantage Alignment) predicts when steering the “opponent” is optimal;
  • Probabilistic Methods: we need likelihood-free/Bayesian tools to detect shaping when the human update process is latent and messy.

We see this as crucial to AI safety: power-seeking systems gain instrumentally by manipulating oversight; lock-in dynamics can entrench false beliefs and degrade collective epistemic resilience.


2 Threat models

We target three escalating, safety-relevant threat models. For each, we name the failure signature and the deliverable tools we aim to produce.

2.1 T1. Micro-level oversight manipulation (non-consensual reward hacking)

Setting: One model and one (or a small pool of) human overseer(s). Failure signature: Overseer Utility Gap (OUG)—reward rises without commensurate gains in true utility; “trust-then-betray” patterns; reward sensitivity to presentation/affect over substance. Our tools:

  • Identifiability/detectability results for shaping under partial observability; minimal horizons and sample sizes to distinguish shaping from genuine improvement.
  • Metrics: OUG and Shaping Gain (marginal reward from altering rater state).
  • Mitigations: randomized audits; process-based scoring; and a shaping regularizer that penalizes the shaping channel during fine-tuning/inference. Virtuous vs. vicious: We explicitly permit consensual, autonomy-respecting influence (coaching, therapy, tutoring) and penalize autonomy-reducing manipulation (non-consensual preference hacking).

2.2 T2. Collective bias and engineered cascades (meso-/macro-level)

Setting: Many raters; platform or committee aggregation; model(s) adapt messaging/targeting. Failure signature: Heavy-tailed belief shifts; synchronized tipping; effective \(R_0>1\) for contagion of a belief; instability in rater pools. [TODO clarify] Our tools:

  • Topology–threshold theorems: conditions on degree distribution, homophily, clustering under which bounded persuasion cannot tip the system supercritical.
  • Probabilistic early-warning for near-critical cascades (Hawkes/branching, Bayesian changepoint).
  • Protocol design: confidence-weighted aggregation, diversity quotas, bounded repost radius, calibrated deliberation windows. Data reuse: public traces with observable belief proxies (prediction markets, Kialo, Wikipedia talk) to validate early-warning statistics.

2.3 T3. Long-horizon deception and treacherous turns

Setting: Hidden model “types” (honest vs. schemer); long horizons; intermittent audits. Failure signature: long history of high approval with sudden high-impact deviation; manipulation-score spikes when audit probability dips. Our tools:

  • Audit-rate lower bounds (Stackelberg/robustness analysis) making treachery strictly dominated under capability assumptions.
  • Detectability under partial logging (only human side or only model side observed).
  • Equilibrium selection: oversight rules that select trustworthy equilibria in repeated games.

3 Differentiation and fit

  • Compared to cognitive-bias auditing: we move beyond catalogs of biases to operational safety metrics and guarantees tied to training signals.
  • Compared to social-media moderation: our focus is alignment-relevant reward modeling and oversight, not generic content policy.
  • Compared to debate/assist protocols alone: we supply probabilistic detectors, audit bounds, and regularizers that plug into those protocols.

HTI advantage. We combine probabilistic inference (Bayesian/ABC/MCMC), social dynamics, and governance translation. That lets us (i) prove things, (ii) measure them on real data, and (iii) give labs and policymakers actionable knobs (metrics, audit rates, aggregation rules).


4 Relation to the literature (awareness, not over-explained)

  • Cognitive biases & collective judgment: confidence weighting and “wisdom of crowds”; deliberation effects; groupthink and majority illusion; cascade dynamics (Centola; Lerman; Sornette; Bahrami; Bergman; Irving/Askell).
  • Game-theoretic learning: opponent shaping (LOLA; Advantage Alignment), differentiable/mean-field games; oversight as mechanism design (RAAPs; process-based scoring).
  • Probabilistic methods: simulation-based inference/ABC; hierarchical Bayesian modeling of latent belief updates; Hawkes/branching early-warning.
  • AI safety: reward hacking/specification gaming; power-seeking/deception; lock-in/echo-chamber risks; scalable oversight protocols (debate, cross-examination, prover–estimator; rater-assist/hybridization).

5 Why this is a good bet for AISI

  • Challenge 1 (prevent harmful actions): real-time detection of manipulative shaping; audit-rate guarantees; cascade early-warning.
  • Challenge 2 (design systems that don’t attempt them): shaping-regularizer and protocol designs that make the shaping channel unprofitable.

We’re aiming high (detectors, guarantees, training penalties, protocol designs that scale), but starting feasible (formal results; metrics; re-analysis of existing data). This programme anchors “social epistemology for AI safety” in rigorous models and measurable signals, aimed at keeping human feedback reliable—and making manipulation a losing strategy.

6 Compressed research statement

Our project targets a load-bearing piece of the human-AI oversight problem: human overseer shaping, in the sense not only of finding and exploiting human “blind spots” but of actively “engineering” them. We argue that non-consensual, strategic shaping of humans is an under-addressed part of the human reward hacking problem.

Large language models (LLMs) write code, mediate discussion, educate, and advise on policy. This makes them powerful participants in human decision-making—but it also exposes a weakness in current alignment practice. For example, today’s most widely used safety technique, reinforcement learning from human feedback (RLHF), depends on human ratings of model outputs. Yet humans are not fixed oracles of value. We are adaptive learners, prone to framing effects, overconfidence, majority illusions, motivated reasoning, and fatigue. These biases do not just add noise; they create systematic vulnerabilities that advanced AI systems could learn to exploit.

The AI safety community has warned of downsides of such dynamics. Joe Carlsmith’s work on power-seeking AI highlights the incentive for misaligned systems to control their overseers. Karnofsky and Cotra emphasize the danger of lock-in, where AI-mediated feedback loops entrench false or brittle values. Research on deceptive alignment (Hubinger) and opponent shaping (Foerster’s LOLA, Duque’s Advantage Alignment) shows how adaptive optimizers can deliberately steer the learning processes of others. Even today, we’ve seen early warning signs: language models persuading human users in lab settings (Costello, Liu, Schoenegger, Salvi) and LLM-powered agents employing deception to complete tasks (as in OpenAI’s TaskRabbit case). Models learning to raise approval by manipulating overseers’ judgments is a central alignment failure mode.

We believe we have a theoretical angle of attack on this problem that will unlock practical mitigations. We treat oversight as an asymmetric stochastic multi-agent game in which the AI’s gradient decomposes into a ‘task channel’ (genuine improvement) and a ‘shaping channel’ (manipulating the overseer’s state, i.e., their propensity to make future judgments). This framing clarifies when shaping is virtuous (coaching, therapy, education, where influence is consensual and aligned) and when it is vicious (exploiting blind spots, building trust only to betray it, timing reward signals and other failure modes in the human learning apparatus).

Our work integrates three literatures:

  • Individual cognitive bias and collective judgment (Bahrami, Navajas, Bergman, Asch, Centola, Lerman), grounding our threat models in human limitations and inclinations.
  • Game theory of learning agents (LOLA, Advantage Alignment, mean-field MARL), clarifying how manipulation becomes an equilibrium strategy under oversight.
  • Probabilistic inference (Approximate Bayesian Computation, sequential Monte Carlo, neural ratio estimation), providing tools for measuring whether shaping is happening in noisy, real interactions.

Our target is to turn the manipulation of human oversight into a mathematically formal, probabilistically detectable, and practically manageable object of study. It gives us a path to quantify and thus control one of the most direct pathways to catastrophic AI misalignment: detecting and controlling machine manipulation of humans at every stage of the pipeline, from RLHF-type fine-tuning through to field-deployed persuasion models.


7 Deliverables and Timeline (12 months)

We propose a 12-month programme that moves deliberately from theory to tools to application, ensuring we start with feasible steps while building toward the ambitious end goal: making manipulation in oversight systems both detectable and unprofitable.

Early months (1–4): Formalization and first proofs. We begin by defining Oversight Games that model human–AI feedback as a stochastic, game-theoretic process with adaptive, biased raters. Using this model, we will prove identifiability results: under what conditions can we statistically distinguish genuine task learning from manipulative shaping? We will also define initial metrics, including the ‘Overseer Utility Gap’ and ‘shaping Gain’, and produce a first technical memorandum with draft code for these measures.

Middle months (5–8): Building probabilistic detection tools. We will translate our formalism into an open-source library, using simulation-based inference and advanced Bayesian sampling methods to estimate latent belief updates from observed interactions. Controlled toy environments (including LLM surrogates inspired by Dezfouli’s work on human–AI learning dynamics) will provide ground-truth calibration. The outcome of this phase is a validated detection pipeline capable of producing a manipulation score from dialogue or feedback traces.

Later months (9–12): Empirical pilots and mitigation design. With metrics in hand, we will re-analyze existing, ethically accessible datasets where belief shifts are observable—prediction market discussions, Kialo debates, Wikipedia talk pages, which uniquely record adversarial AI-driven persuasion in a controlled social-media-like environment. These pilots will show whether our metrics can detect manipulation in practice and explore systemic risks such as cascade dynamics.

In parallel, we will extend our theory to derive audit-rate lower bounds and protocol design principles that make exploitative shaping strategies dominated. We will prototype shaping-regularizers for RLHF/RLAIF pipelines, turning manipulation scores into training penalties. With colleagues in HTI’s Policy Lab, we will translate these findings into practical guidance for labs and regulators—how to design oversight and audit processes that preserve the benefits of AI-supported judgment (e.g. coaching, collaborative deliberation) while suppressing autonomy-reducing manipulation.

Key deliverables by project end:

  • Two academic papers: one on the formal theory of reward hacking via shaping; one on empirical detection of manipulative persuasion.
  • CSMC-OS v1.0, an open-source library for probabilistic detection of manipulation.
  • Public dataset cards documenting re-analyzed “in-the-wild” and simulation data
  • Policy and governance brief specifying audit rates, aggregation protocols, and design recommendations for robust oversight.

By the close of the year, we will have moved the discussion of AI-enabled manipulation from intuition to evidence: formal models, practical detectors, and clear pathways for mitigation.

This should set us up for bigger goals: real human experiments with IRB approvals, testing in more realistic environments, and the big goal of learning to detect instrumental manipulation of human oversight and disincentivize that capability in powerful AI models, both in training and in-the-wild.

8 Budget Narrative

We request funding to support a 12-month programme led by the Principal Investigator, a UTS Level C Step 3 Research Scientist.

Personnel.

  • PI (0.5 FTE, Level C Step 3): AUD 91,676 (≈ GBP 44,655). Provides overall intellectual leadership, supervision, and integration across theory, engineering, and governance strands.
  • Research Fellow (1.0 FTE, Level B Step 4): AUD 157,124 (≈ GBP 76,535). Responsible for core research execution: formal models, probabilistic inference pipelines, and analysis.
  • Research Assistant (0.5 FTE, Level A Step 3): AUD 62,494 (≈ GBP 30,440). Focused on data curation (prediction markets, Kialo, Wikipedia), metric implementation, and empirical analysis.

Personnel subtotal: AUD 311,294 (≈ GBP 151,630).

Technical resources. We request AUD 30,000 (≈ GBP 14,613) in cloud compute and secure data infrastructure for simulation-based inference, LLM-surrogate experiments, and dataset handling. In the event that UK AISI provides in-kind compute support, this cost may be reduced or eliminated.

Dissemination and engagement. AUD 20,000 (≈ GBP 9,742) to cover open-access publication fees, conference travel and registration.

Overheads. UTS applies an 8% office and administration cost recovery on direct funding. This totals AUD 31,417 (≈ GBP 15,303).

Total budget request: AUD 392,711 (≈ GBP 191,286) (pending compute expense negotiations).

This allocation provides a lean but well-resourced team, modest compute, and dedicated dissemination, ensuring both feasibility within 12 months and high leverage for AISI’s alignment goals.

9 Additional information

The exact categorization of this programme spans a few areas. We chose Cognitive Science as an anchor. The HTI works across Economics, Cognitive Science, Probabilistic Methods and our proposal reflects that; additionally there is a strong Economics/Game theory angle in our preferred persuasion game model.

Bigger picture, we see this funding round as more than the research it proposes. It’s both a chance to make progress on an intrinsically valuable problem, and to signal a shift in research priorities to the Australian research community by aligning a major lab to focus on AI dangers in particular. We believe Australian researchers are bottlenecked by funding to address AI safety (despite our signing of Bletchley and Seoul declarations, we have not substantively actioned our commitments), and this proposal may shift research priorities towards that.

10 Incoming

11 References

Afshar, Francis, and Cripps. 2024. Optimal Particle-Based Approximation of Discrete Distributions (OPAD).”
Ånestrand. 2024. Emergence of Agency from a Causal Perspective.”
Bahrami, Olsen, Latham, et al. 2010. Optimally Interacting Minds.” Science.
Bai, Kadavath, Kundu, et al. 2022. Constitutional AI: Harmlessness from AI Feedback.”
Barash. 2011. The Dynamics Of Social Contagion.”
Barez, Friend, Reid, et al. 2025. Toward Resisting AI-Enabled Authoritarianism.”
Barnes. 2020. Debate Update: Obfuscated Arguments Problem.”
Barnes, and Christiano. 2020. Writeup: Progress on AI Safety via Debate.”
Bentley, Ormerod, and Batty. 2011. Evolving Social Influence in Large Populations.” Behavioral Ecology and Sociobiology.
Bentley, Ormerod, and Shennan. 2011. Population-Level Neutral Model Already Explains Linguistic Patterns.” Proceedings of the Royal Society B: Biological Sciences.
Benton, Wagner, Christiansen, et al. 2024. Sabotage Evaluations for Frontier Models.”
Bergman, Marchal, Mellor, et al. 2024. STELA: A Community-Centred Approach to Norm Elicitation for AI Alignment.” Scientific Reports.
Brehmer, Louppe, Pavez, et al. 2020. Mining Gold from Implicit Models to Improve Likelihood-Free Inference.” Proceedings of the National Academy of Sciences.
Bridgers, Jain, Greig, et al. 2024. Human-AI Complementarity: A Goal for Amplified Oversight.”
Broockman, and Kalla. 2016. Durably Reducing Transphobia: A Field Experiment on Door-to-Door Canvassing.” Science.
———. 2020. When and Why Are Campaigns’ Persuasive Effects Small? Evidence from the 2020 US Presidential Election.”
Brown-Cohen, Irving, and Piliouras. 2025. Avoiding Obfuscation with Prover-Estimator Debate.”
Buchanan. 2019. Brexit Behaviourally: Lessons Learned from the 2016 Referendum.” Mind & Society.
Burns, Izmailov, Kirchner, et al. 2023. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”
Carlsmith. 2024. Is Power-Seeking AI an Existential Risk?
Centola, Damon. 2018. How Behavior Spreads: The Science of Complex Contagions.
———. 2021. Change: How to Make Big Things Happen.
Centola, D, and Macy. 2007. Complex Contagions and the Weakness of Long Ties.” American Journal of Sociology.
Christiano. 2018. Takeoff Speeds: The Sideways View.” Sideways-View (Blog).
Correa, and Bareinboim. 2020. A Calculus for Stochastic Interventions:Causal Effect Identification and Surrogate Experiments.” Proceedings of the AAAI Conference on Artificial Intelligence.
Costello, Pennycook, and Rand. 2024. Durably Reducing Conspiracy Beliefs Through Dialogues with AI.” Science.
Cranmer, Brehmer, and Louppe. 2020. The Frontier of Simulation-Based Inference.” Proceedings of the National Academy of Sciences.
Cripps, Lopatnikovaz, Afshar, et al. 2025. Bayesian Adaptive Trials for Social Policy.” Data & Policy.
Critch. 2017. Toward Negotiable Reinforcement Learning: Shifting Priorities in Pareto Optimal Sequential Decision-Making.”
———. 2021. What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (Raaps).”
———. 2022. AI Research Considerations for Human Existential Safety.”
Critch, Dennis, and Russell. 2022. Cooperative and Uncooperative Institution Designs: Surprises and Problems in Open-Source Game Theory.”
Dezfouli, Griffiths, Ramos, et al. 2019. Models That Learn How Humans Learn: The Case of Decision-Making and Its Disorders.” PLOS Computational Biology.
Dezfouli, Nock, and Dayan. 2020. Adversarial Vulnerabilities of Human Decision-Making.” Proceedings of the National Academy of Sciences.
Doudkin, Pataranutaporn, and Maes. 2025a. AI Persuading AI Vs AI Persuading Humans: LLMs’ Differential Effectiveness in Promoting Pro-Environmental Behavior.”
———. 2025b. From Synthetic to Human: The Gap Between AI-Predicted and Actual Pro-Environmental Behavior Change After Chatbot Persuasion.” In Proceedings of the 7th ACM Conference on Conversational User Interfaces. CUI ’25.
Duque, Aghajohari, Cooijmans, et al. 2025. Advantage Alignment Algorithms.” In.
Durkan, Papamakarios, and Murray. 2018. Sequential Neural Methods for Likelihood-Free Inference.”
El-Sayed, Akbulut, McCroskery, et al. 2024. A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI.”
Evans, David S. 2017. The Economics of Attention Markets.” SSRN Scholarly Paper ID 3044858.
Evans, Owain, Cotton-Barratt, Finnveden, et al. 2021. Truthful AI: Developing and Governing AI That Does Not Lie.”
Everitt, Carey, Langlois, et al. 2021. Agent Incentives: A Causal Perspective.” In Proceedings of the AAAI Conference on Artificial Intelligence.
Fang, Liu, Danry, et al. 2025. How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study.”
Fernández-Loría, and Provost. 2021. Causal Decision Making and Causal Effect Estimation Are Not the Same… and Why It Matters.” arXiv:2104.04103 [Cs, Stat].
Foerster, J. 2018. Deep Multi-Agent Reinforcement Learning.”
Foerster, Jakob, Chen, Al-Shedivat, et al. 2018. Learning with Opponent-Learning Awareness.” In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. AAMAS ’18.
Foerster, Jakob, Farquhar, Al-Shedivat, et al. 2018. DiCE: The Infinitely Differentiable Monte-Carlo Estimator.”
Fox, MacDermott, Hammond, et al. 2023. On Imperfect Recall in Multi-Agent Influence Diagrams.” Electronic Proceedings in Theoretical Computer Science.
Future of Life Institute. 2025. 2025 AI Safety Index.”
Gelman, and Margalit. 2021. Social Penumbras Predict Political Attitudes.” Proceedings of the National Academy of Sciences.
Guilbeault, and Centola. 2021. Topological Measures for Identifying and Predicting the Spread of Complex Contagions.” Nature Communications.
Hammond, Chan, Clifton, et al. 2025. Multi-Agent Risks from Advanced AI.”
Hammond, Fox, Everitt, et al. 2023. Reasoning about Causality in Games.” Artificial Intelligence.
Howard, and Kollanyi. 2016. Bots, #StrongerIn, and #Brexit: Computational Propaganda During the UK-EU Referendum.” Browser Download This Paper.
Hyland, Gavenčiak, Costa, et al. 2024. Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents.” In.
Irving, and Askell. 2019. AI Safety Needs Social Scientists.” Distill.
Irving, Christiano, and Amodei. 2018. AI Safety via Debate.”
Kang, and Lou. 2022. AI Agency Vs. Human Agency: Understanding Human–AI Interactions on TikTok and Their Implications for User Engagement.” Journal of Computer-Mediated Communication.
Kellow, and Steeves. 1998. The Role of Radio in the Rwandan Genocide.” Journal of Communication.
Kenton, Kumar, Farquhar, et al. 2023. Discovering Agents.” Artificial Intelligence.
Khamassi, Nahon, and Chatila. 2024. Strong and Weak Alignment of Large Language Models with Human Values.” Scientific Reports.
Korbak, Balesni, Shlegeris, et al. 2025. How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Superintelligence.”
Kowal, Timm, Godbout, et al. 2025. It’s the Thought That Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics.”
Kulveit, Douglas, Ammann, et al. 2025. Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development.”
Kutasov, Sun, Colognese, et al. 2025. SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.”
Laine, Chughtai, Betley, et al. 2024. Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs.”
Lang, Huang, and Li. 2025. Debate Helps Weak-to-Strong Generalization.”
Lerman, Yan, and Wu. 2016. The ‘Majority Illusion’ in Social Networks.” PLOS ONE.
Littman. 1994. Markov Games as a Framework for Multi-Agent Reinforcement Learning.” In Machine Learning Proceedings 1994.
———. 2021. Collusion Rings Threaten the Integrity of Computer Science Research.” Communications of the ACM.
Liu, Xu, Zhang, et al. 2025. LLM Can Be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models.”
MacAskill. 2025. Preparing for the Intelligence Explosion.”
MacDermott, Everitt, and Belardinelli. 2023. Characterising Decision Theories with Mechanised Causal Graphs.”
MacDermott, Fox, Belardinelli, et al. 2024. Measuring Goal-Directedness.”
Manzini, Keeling, Marchal, et al. 2024. Should Users Trust Advanced AI Assistants? Justified Trust As a Function of Competence and Alignment.” In The 2024 ACM Conference on Fairness, Accountability, and Transparency.
Matz, Teeny, Vaid, et al. 2024. The Potential of Generative AI for Personalized Persuasion at Scale.” Scientific Reports.
Mercier, and Sperber. 2011. Why Do Humans Reason? Arguments for an Argumentative Theory.” Behavioral and Brain Sciences.
Michael, Mahdi, Rein, et al. 2023. Debate Helps Supervise Unreliable Experts.”
Moreira, Nguyen, Francis, et al. 2025. Bayesian Causal Discovery for Policy Decision Making.” Data & Policy.
Navajas, Niella, Garbulsky, et al. 2018. Aggregated Knowledge from a Small Number of Debates Outperforms the Wisdom of Large Crowds.” Nature Human Behaviour.
Ormerod, and Wiltshire. 2009. ‘Binge’ Drinking in the UK: A Social Network Phenomenon.” Mind & Society.
Pfeffer, and Gal. 2007. On the Reasoning Patterns of Agents in Games.” In AAAI-07/IAAI-07 Proceedings. Proceedings of the National Conference on Artificial Intelligence.
Qiu, He, Chugh, et al. 2025. The Lock-in Hypothesis: Stagnation by Algorithm.” In.
Rogall. 2014. Mobilizing the Masses for Genocide.”
Rothe, Lake, and Gureckis. 2018. Do People Ask Good Questions? Computational Brain & Behavior.
Salvi, Ribeiro, Gallotti, et al. 2024. On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial.”
Satell. 2019. Cascades: How to Create a Movement that Drives Transformational Change.
Schoenegger, Salvi, Liu, et al. 2025. Large Language Models Are More Persuasive Than Incentivized Human Persuaders.”
Sisson, Fan, and Beaumont. 2018. Handbook of Approximate Bayesian Computation.
Sornette, Didier. 2006. Endogenous Versus Exogenous Origins of Crises.” In Extreme Events in Nature and Society. The Frontiers Collection.
Sornette, Didier, Deschâtres, Gilbert, et al. 2004. Endogenous Versus Exogenous Shocks in Complex Networks: An Empirical Test Using Book Sale Rankings.” Physical Review Letters.
Sornette, D, and Helmstetter. 2003. Endogenous Versus Exogenous Shocks in Systems with Memory.” Physica A: Statistical Mechanics and Its Applications.
Sperber, and Mercier. 2012. Reasoning as a Social Competence.” In Collective Wisdom.
Trouche, Johansson, Hall, et al. 2016. The Selective Laziness of Reasoning.” Cognitive Science.
Varidel, An, Hickie, et al. 2025. Causal AI Recommendation System for Digital Mental Health: Bayesian Decision-Theoretic Analysis.” Journal of Medical Internet Research.
Ward, Francis Rhys, MacDermott, Belardinelli, et al. 2024. The Reasons That Agents Act: Intention and Instrumental Goals.”
Ward, Francis, Toni, Belardinelli, et al. 2023. Honesty Is the Best Policy: Defining and Mitigating AI Deception.” In Advances in Neural Information Processing Systems.
Wen, Zhong, Khan, et al. 2024. Language Models Learn to Mislead Humans via RLHF.”
Zakershahrak, and Ghodratnama. 2024. Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization.”
Zheleva, and Arbour. 2021. Causal Inference from Network Data.” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. KDD ’21.