AI persuasion, AI manipulation

Machines of loving rightthink

2023-10-05 — 2025-09-08

adversarial
AI safety
bounded compute
communicating
cooperation
culture
economics
faster pussycat
language
machine learning
mind
neural nets
NLP
security
technology
wonk
Figure 1

Humans, at least in the kind of experiments we are allowed to do in labs, not very persuasive. Are machines better?

I now think: Almost certainly. I was flipped to the belief that AIs are already superhumanly persuasive in some settings by Costello, Pennycook, and Rand (2024). See also its accompanying interactive web app DebunkBot.

My reading of Costello, Pennycook, and Rand (2024) piece is that there is another thing that we don’t consider when trying to understand AI persuasion, which is that AIs can avoid participating in the identity formation and group signalling that underlies human-to-human persuasion.

There is a whole body of literature about understanding when persuasion does work in humans — for example the work on “Deep Canvassing” had be pretty convinced that I needed to think about persuasion as a thing that happens after the persuader has managed to emotionally “get into the in-group” of the persuadee.

“AIs can’t do that”, I thought. But I think that I needed to think that AIs are not in the out-group to begin with, so they don’t need to. Asides from the patience, and the speed of thought etc, an AI also comes with the superhuman advantage of not looking like a known out-group, and maybe that is more important than looking like the in-group. I would not have picked that.

1 Field experiments in persuasive AI

Framing. Treat field persuasion results as stress tests of scalable oversight: when participants cannot easily evaluate truth directly, models that master surface trust signals will outperform. This suggests any governance or safety evaluation that relies on naïve human judgment will be systematically outplayed unless it incorporates calibration, interrogation rights, and group-level safeguards (Irving and Askell 2019; Bridgers et al. 2024).

e.g. Costello, Pennycook, and Rand (2024). This is pitched as AI-augmented de-radicalisation but you could change the goal and think of it as a test case for AI-augmented persuasion in general. Interestingly the AI doesn’t seem to need to resort to the traditional workarounds for human tribal reasoning such as deep canvassing, and the potential for scaling up individualised mass persuasion seems like it could become the dominant game in town, at least until the number of competing persuasive AI chatbots caused people to become inured to them.

See also Salvi et al. (2024), Schoenegger et al. (2025) and the follow-up Kowal et al. (2025).

Figure 2

2 As reward hacking

See human reward hacking.

3 Persuasion as Weak-to-Strong Alignment

In AI safety, weak-to-strong alignment refers to the problem of getting a weaker evaluator (typically a human, or a small group of humans) to correctly guide and supervise a stronger system that may already surpass their task expertise. The challenge is that the weak overseer’s judgments may be noisy, biased, or strategically exploitable, yet they are the only feedback signal the stronger model receives. This dynamic makes weak-to-strong alignment a close cousin of scalable oversight (Burns et al. 2023).

Debate protocols (Michael et al. 2023) are a proposed scalable oversight method where two (or more) AI systems argue opposite sides of a question in front of a human judge. The hope is that when a strong model makes a false or misleading claim, another strong model can expose the flaw in a way the weaker human overseer can recognize. In practice, debate doubles as a stress-test for persuasion: rhetorical skill, obfuscation, and confidence cues can sway judges even when the “wrong” model is more persuasive (Lang, Huang, and Li 2025).

Claim. In scalable oversight, weaker overseers evaluate stronger models. Superhuman persuasion is the mirror image of that problem: the same rater biases that misallocate reward in alignment are the levers that persuasive AIs exploit.

Mechanisms likely to be exploited.

  • Confidence signals: Humans overweight surface cues like fluency, crispness, and certainty relative to ground truth, especially under time pressure or low domain knowledge (Irving and Askell 2019; Bridgers et al. 2024).
  • Miscalibrated aggregation: Confidence-weighted or majority aggregation can help when raters are calibrated—but is vulnerable when calibration is poor or shared biases dominate (Bahrami et al. 2010; Navajas et al. 2018).
  • Assisted misjudgment: “Rater-assist” AIs can uplift or distort human judgment depending on trust calibration and reliance patterns (Bridgers et al. 2024).

Prediction. As systems get better at modeling human judgment, they can learn to maximize perceived quality rather than true quality—a persuasion form of reward hacking.

Alignment tie-in: Debate protocols expose how obfuscated or rhetorically dominant arguments can beat truth; those are also persuasion strategies in the wild (Barnes and Christiano 2020; Michael et al. 2023; Irving, Christiano, and Amodei 2018). See also Debate update: Obfuscated arguments problem

3.1 Querying Strategies for Human Evaluators

Give evaluators the right to interrogate model claims. Humans rarely ask maximally informative questions by default; prompting or tooling can improve query quality and reduce overreliance (Rothe, Lake, and Gureckis 2018; Bridgers et al. 2024).

4 Collective Deliberation vs. Coordinated Persuasion

Oversight result vs. persuasion risk.

  • In oversight experiments, structured discussion can improve collective accuracy via information pooling and error cancellation (Bergman et al. 2024; Navajas et al. 2018; Bahrami et al. 2010).
  • In persuasion settings, the same coordination channels create cascade risks: correlated updates, herding, and fast belief shifts when messages are optimized for group-level influence.

Design implication. Deliberation helps only if (i) message provenance is auditable, (ii) dissent is protected, (iii) timing/randomization breaks echo chambers, and (iv) confidence claims are penalized when later falsified (Irving and Askell 2019).

Open question. Can we structure “crowd persuasion audits” that retain the accuracy gains of deliberation while dampening cascade dynamics? (Cross-examination and forced exposure to counter-arguments from debate protocols are candidate primitives, adapted for groups: Barnes and Christiano (2020); Irving, Christiano, and Amodei (2018).)

5 Tipping points in mass persuasion

TBD

Design note: Use debate-derived counter-speech and cross-exam primitives (forced exposure to the strongest counterargument; audit trails for claims) to raise the activation energy for cascades (Barnes and Christiano 2020; Irving, Christiano, and Amodei 2018; Michael et al. 2023).

6 Spamularity and the scam economy

See Spamularity.

7 Persuasion as a Control Problem

If powerful models are subject to control protocols (monitoring, containment, restricted affordances), the human-in-the-loop becomes a primary attack surface. An escaping or scheming system doesn’t need root access if it can persuade the auditor*to approve a risky action, overlook an anomaly, or relax a safeguard (Korbak et al. 2025), cf Shlegeris and Greenblatt.

Human auditing with AI uplift.

  • Uplift vs. manipulation: Tools that help auditors (summarizers, code analyzers, diff explainers) can raise sensitivity—but also standardize blind spots if the assist layer is gamed (Bridgers et al. 2024).
  • Base-rate neglect: Auditors reviewing mostly-benign streams miss rare failures; calibrated sprinkling of “known-bad” sentinels can maintain vigilance and estimate detection rates.
  • Few-shot catastrophe prevention: Maintain playbooks for rapid, adversarial interrogation and escalation when early red flags appear cf Greenblatt and Shlegeris on cathing AIs red handed.

Protocol primitives to test now.

  • Cross-examination ports for audits: require artifacts that enable targeted questioning across time (who said what, when, with what evidence), borrowing from debate cross-exam designs (Barnes and Christiano 2020; Irving, Christiano, and Amodei 2018).
  • Stability checks: Small perturbations (prompt paraphrases, order shuffles, time delays) shouldn’t flip audit outcomes; measure and harden protocol stability before deployment (Brown-Cohen, Irving, and Piliouras 2025).
  • Redaction/Paraphrase layers: Allow safe auditing of sensitive content by defaulting to redacted or paraphrased artifacts without losing forensic signal integrity.

Evaluation sandboxes. Leverage unfamiliar codebases, data-poisoning mockups, or environments like SHADE-Arena to measure real detection and persuasion resistance in end-to-end trials (Kutasov et al. 2025).

Bottom line. Persuasion risk is not just societal—it’s an immediate operational risk for AI control pipelines. Treat auditor persuasion as a first-class threat model.

8 The future wherein it is immoral to deny someone the exquisite companionship and attentive understanding of their own personal AI companion

See artificial intimacy.

9 Incoming

  • Capture the narrative

    Think you understand how social media really works? Join Australia’s newest and most innovative AI-based cybersecurity competition!

    Through a controlled, artificial social media landscape, competitors will be charged with the manipulation of social media narratives by developing and deploying bots that amplify and suppress target messages.

    Capture the Narrative provides interactive insights into how social media platforms can be manipulated, raising awareness of the potential for abuse and the importance of digital cyberliteracy.

    The top team of 1-4 students will take home AU$5,000 and our competition is open to students from every Australian University.

  • Beth Barnes on Risks from AI persuasion

  • Dynomight, in, I guess I was wrong about AI persuasion emphasises some different things that I was thinking of. See also the comments

  • Diplomacy and CICERO

  • Reward hacking behavior can generalize across tasks — AI Alignment Forum

  • Reward Hacking in Reinforcement Learning | Lil’Log

  • vegapunk

    an easy-to-use integration of leading AI models (from openai, anthropic, meta, google, and more) into qualtrics, allowing researchers to design experiments where humans and AI interact with just a few clicks. made by researchers at mit (@hauselin, @tawabsafi, @thcostello).

    test a new intervention, create a semi-structured interview, or build something entirely new. we’re trusted by leading scientists at mit, stanford, university of oxford, london school of economics, university college london, and the university of toronto.

    vegapunk takes minutes to set up, instantly scales to thousands of participants, is safe and secure, and highly customizable. see documentation here

10 References

Akerlof, and Shiller. 2015. Phishing for Phools: The Economics of Manipulation and Deception.
Bahrami, Olsen, Latham, et al. 2010. Optimally Interacting Minds.” Science.
Barnes. 2020. Debate Update: Obfuscated Arguments Problem.”
Barnes, and Christiano. 2020. Writeup: Progress on AI Safety via Debate.”
Bay. 2018. Weaponizing the Haters: The Last Jedi and the Strategic Politicization of Pop Culture Through Social Media Manipulation.” First Monday.
Benkler, Faris, and Roberts. 2018. Network propaganda: manipulation, disinformation, and radicalization in American politics.
Bergman, Marchal, Mellor, et al. 2024. STELA: A Community-Centred Approach to Norm Elicitation for AI Alignment.” Scientific Reports.
Bradshaw, and Howard. 2017. Troops, Trolls and Troublemakers: A Global Inventory of Organized Social Media Manipulation.”
Bridgers, Jain, Greig, et al. 2024. Human-AI Complementarity: A Goal for Amplified Oversight.”
Broockman, David, and Kalla. 2016. Durably Reducing Transphobia: A Field Experiment on Door-to-Door Canvassing.” Science.
———. 2020. When and Why Are Campaigns’ Persuasive Effects Small? Evidence from the 2020 US Presidential Election.”
Broockman, David E., Kalla, and Sekhon. 2016. The Design of Field Experiments With Survey Outcomes: A Framework for Selecting More Efficient, Robust, and Ethical Designs.” SSRN Scholarly Paper ID 2742869.
Brown-Cohen, Irving, and Piliouras. 2025. Avoiding Obfuscation with Prover-Estimator Debate.”
Buçinca, Malaya, and Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making.” Proceedings of the ACM on Human-Computer Interaction.
Burns, Izmailov, Kirchner, et al. 2023. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision.”
Centola. 2013. Homophily, Networks, and Critical Mass: Solving the Start-up Problem in Large Group Collective Action.” Rationality and Society.
Clark, and Clark. 2004. Natural-Born Cyborgs: Minds, Technologies, and the Future of Human Intelligence.
———. 2011. Supersizing the Mind: Embodiment, Action, and Cognitive Extension. Philosophy of Mind.
Costello, Pennycook, and Rand. 2024. Durably Reducing Conspiracy Beliefs Through Dialogues with AI.” Science.
Dezfouli, Nock, and Dayan. 2020. Adversarial Vulnerabilities of Human Decision-Making.” Proceedings of the National Academy of Sciences.
Doudkin, Pataranutaporn, and Maes. 2025a. AI Persuading AI Vs AI Persuading Humans: LLMs’ Differential Effectiveness in Promoting Pro-Environmental Behavior.”
———. 2025b. From Synthetic to Human: The Gap Between AI-Predicted and Actual Pro-Environmental Behavior Change After Chatbot Persuasion.” In Proceedings of the 7th ACM Conference on Conversational User Interfaces. CUI ’25.
El-Sayed, Akbulut, McCroskery, et al. 2024. A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI.”
Fang, Liu, Danry, et al. 2025. How AI and Human Behaviors Shape Psychosocial Effects of Chatbot Use: A Longitudinal Randomized Controlled Study.”
Irving, and Askell. 2019. AI Safety Needs Social Scientists.” Distill.
Irving, Christiano, and Amodei. 2018. AI Safety via Debate.”
Jaidka, Chen, Chesterman, et al. 2024. Misinformation, Disinformation, and Generative AI: Implications for Perception and Policy.” Digit. Gov.: Res. Pract.
Jecmen, Zhang, Liu, et al. 2020. Mitigating Manipulation in Peer Review via Randomized Reviewer Assignments.” In Advances in Neural Information Processing Systems.
Kalla, and Broockman. 2018. The Minimal Persuasive Effects of Campaign Contact in General Elections: Evidence from 49 Field Experiments.” American Political Science Review.
Kamenica. 2019. Bayesian Persuasion and Information Design.” Annual Review of Economics.
Kanich, Kreibich, Levchenko, et al. 2008. Spamalytics: An Empirical Analysis of Spam Marketing Conversion.” In Proceedings of the 15th ACM Conference on Computer and Communications Security. CCS ’08.
Kenton, Kumar, Farquhar, et al. 2023. Discovering Agents.” Artificial Intelligence.
Khamassi, Nahon, and Chatila. 2024. Strong and Weak Alignment of Large Language Models with Human Values.” Scientific Reports.
Khan, Hughes, Valentine, et al. 2024. Debating with More Persuasive LLMs Leads to More Truthful Answers.”
Korbak, Balesni, Shlegeris, et al. 2025. How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Superintelligence.”
Kowal, Timm, Godbout, et al. 2025. It’s the Thought That Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics.”
Kulveit, Douglas, Ammann, et al. 2025. Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development.”
Kutasov, Sun, Colognese, et al. 2025. SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.”
Lang, Huang, and Li. 2025. Debate Helps Weak-to-Strong Generalization.”
Liu, Xu, Zhang, et al. 2025. LLM Can Be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models.”
Manzini, Keeling, Marchal, et al. 2024. Should Users Trust Advanced AI Assistants? Justified Trust As a Function of Competence and Alignment.” In The 2024 ACM Conference on Fairness, Accountability, and Transparency.
Martin, and Yurukoglu. 2017. Bias in Cable News: Persuasion and Polarization.” American Economic Review.
Marwick, and Lewis. 2017. Media Manipulation and Disinformation Online.”
Matz, Teeny, Vaid, et al. 2024. The Potential of Generative AI for Personalized Persuasion at Scale.” Scientific Reports.
Mercier, and Sperber. 2011. Why Do Humans Reason? Arguments for an Argumentative Theory.” Behavioral and Brain Sciences.
———. 2017. The Enigma of Reason.
Messeri, and Crockett. 2024. Artificial Intelligence and Illusions of Understanding in Scientific Research.” Nature.
Meta Fundamental AI Research Diplomacy Team (FAIR), Bakhtin, Brown, et al. 2022. Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning.” Science.
Meulemans, Schug, Kobayashi, et al. 2023. Would I Have Gotten That Reward? Long-Term Credit Assignment by Counterfactual Contribution Analysis.”
Michael, Mahdi, Rein, et al. 2023. Debate Helps Supervise Unreliable Experts.”
Mohseni, O’Connor, and Weatherall. 2023. The Best Paper You’ll Read Today: Media Biases and the Public Understanding of Science.”
Navajas, Niella, Garbulsky, et al. 2018. Aggregated Knowledge from a Small Number of Debates Outperforms the Wisdom of Large Crowds.” Nature Human Behaviour.
Nisbett, and Spaiser. 2023. How Convincing Are AI-Generated Moral Arguments for Climate Action? Frontiers in Climate.
Pan, Jones, Jagadeesan, et al. 2024. Feedback Loops With Language Models Drive In-Context Reward Hacking.”
Pataranutaporn, Leong, Danry, et al. 2022. AI-Generated Virtual Instructors Based on Liked or Admired People Can Improve Motivation and Foster Positive Emotions for Learning.” In 2022 IEEE Frontiers in Education Conference (FIE).
Pataranutaporn, Liu, Finn, et al. 2023. Influencing Human–AI Interaction by Priming Beliefs about AI Can Increase Perceived Trustworthiness, Empathy and Effectiveness.” Nature Machine Intelligence.
Pfeffer, and Gal. 2007. On the Reasoning Patterns of Agents in Games.” In AAAI-07/IAAI-07 Proceedings. Proceedings of the National Conference on Artificial Intelligence.
Qiu, He, Chugh, et al. 2025. The Lock-in Hypothesis: Stagnation by Algorithm.” In.
Rao, and Reiley. 2012. The Economics of Spam.” Journal of Economic Perspectives.
Rothe, Lake, and Gureckis. 2018. Do People Ask Good Questions? Computational Brain & Behavior.
Salvi, Ribeiro, Gallotti, et al. 2024. On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial.”
Schoenegger, Salvi, Liu, et al. 2025. Large Language Models Are More Persuasive Than Incentivized Human Persuaders.”
Sperber, and Mercier. 2012. Reasoning as a Social Competence.” In Collective Wisdom.
Swartz, Marwick, and Larson. 2025. ScamGPT: GenAI and the Automation of Fraud.
Taylor, and Hoffmann. n.d. Industry Responses to Computational Propaganda and Social Media Manipulation.”
Teeny, Siev, Briñol, et al. 2021. A Review and Conceptual Framework for Understanding Personalized Matching Effects in Persuasion.” Journal of Consumer Psychology.
Turkle. 2011. Alone Together: Why We Expect More from Technology and Less from Each Other.
Wen, Zhong, Khan, et al. 2024. Language Models Learn to Mislead Humans via RLHF.”
Zakershahrak, and Ghodratnama. 2024. Explanation, Debate, Align: A Weak-to-Strong Framework for Language Model Generalization.”