AI persuasion, AI manipulation
Machines of loving rightthink
2023-10-05 — 2025-09-08
Suspiciously similar content
Humans, at least in the kind of experiments we are allowed to do in labs, not very persuasive. Are machines better?
I now think: Almost certainly. I was flipped to the belief that AIs are already superhumanly persuasive in some settings by Costello, Pennycook, and Rand (2024). See also its accompanying interactive web app DebunkBot.
My reading of Costello, Pennycook, and Rand (2024) piece is that there is another thing that we don’t consider when trying to understand AI persuasion, which is that AIs can avoid participating in the identity formation and group signalling that underlies human-to-human persuasion.
There is a whole body of literature about understanding when persuasion does work in humans — for example the work on “Deep Canvassing” had be pretty convinced that I needed to think about persuasion as a thing that happens after the persuader has managed to emotionally “get into the in-group” of the persuadee.
“AIs can’t do that”, I thought. But I think that I needed to think that AIs are not in the out-group to begin with, so they don’t need to. Asides from the patience, and the speed of thought etc, an AI also comes with the superhuman advantage of not looking like a known out-group, and maybe that is more important than looking like the in-group. I would not have picked that.
1 Field experiments in persuasive AI
Framing. Treat field persuasion results as stress tests of scalable oversight: when participants cannot easily evaluate truth directly, models that master surface trust signals will outperform. This suggests any governance or safety evaluation that relies on naïve human judgment will be systematically outplayed unless it incorporates calibration, interrogation rights, and group-level safeguards (Irving and Askell 2019; Bridgers et al. 2024).
e.g. Costello, Pennycook, and Rand (2024). This is pitched as AI-augmented de-radicalisation but you could change the goal and think of it as a test case for AI-augmented persuasion in general. Interestingly the AI doesn’t seem to need to resort to the traditional workarounds for human tribal reasoning such as deep canvassing, and the potential for scaling up individualised mass persuasion seems like it could become the dominant game in town, at least until the number of competing persuasive AI chatbots caused people to become inured to them.
See also Salvi et al. (2024), Schoenegger et al. (2025) and the follow-up Kowal et al. (2025).
2 As reward hacking
See human reward hacking.
3 Persuasion as Weak-to-Strong Alignment
In AI safety, weak-to-strong alignment refers to the problem of getting a weaker evaluator (typically a human, or a small group of humans) to correctly guide and supervise a stronger system that may already surpass their task expertise. The challenge is that the weak overseer’s judgments may be noisy, biased, or strategically exploitable, yet they are the only feedback signal the stronger model receives. This dynamic makes weak-to-strong alignment a close cousin of scalable oversight (Burns et al. 2023).
Debate protocols (Michael et al. 2023) are a proposed scalable oversight method where two (or more) AI systems argue opposite sides of a question in front of a human judge. The hope is that when a strong model makes a false or misleading claim, another strong model can expose the flaw in a way the weaker human overseer can recognize. In practice, debate doubles as a stress-test for persuasion: rhetorical skill, obfuscation, and confidence cues can sway judges even when the “wrong” model is more persuasive (Lang, Huang, and Li 2025).
Claim. In scalable oversight, weaker overseers evaluate stronger models. Superhuman persuasion is the mirror image of that problem: the same rater biases that misallocate reward in alignment are the levers that persuasive AIs exploit.
Mechanisms likely to be exploited.
- Confidence signals: Humans overweight surface cues like fluency, crispness, and certainty relative to ground truth, especially under time pressure or low domain knowledge (Irving and Askell 2019; Bridgers et al. 2024).
- Miscalibrated aggregation: Confidence-weighted or majority aggregation can help when raters are calibrated—but is vulnerable when calibration is poor or shared biases dominate (Bahrami et al. 2010; Navajas et al. 2018).
- Assisted misjudgment: “Rater-assist” AIs can uplift or distort human judgment depending on trust calibration and reliance patterns (Bridgers et al. 2024).
Prediction. As systems get better at modeling human judgment, they can learn to maximize perceived quality rather than true quality—a persuasion form of reward hacking.
Alignment tie-in: Debate protocols expose how obfuscated or rhetorically dominant arguments can beat truth; those are also persuasion strategies in the wild (Barnes and Christiano 2020; Michael et al. 2023; Irving, Christiano, and Amodei 2018). See also Debate update: Obfuscated arguments problem
3.1 Querying Strategies for Human Evaluators
Give evaluators the right to interrogate model claims. Humans rarely ask maximally informative questions by default; prompting or tooling can improve query quality and reduce overreliance (Rothe, Lake, and Gureckis 2018; Bridgers et al. 2024).
4 Collective Deliberation vs. Coordinated Persuasion
Oversight result vs. persuasion risk.
- In oversight experiments, structured discussion can improve collective accuracy via information pooling and error cancellation (Bergman et al. 2024; Navajas et al. 2018; Bahrami et al. 2010).
- In persuasion settings, the same coordination channels create cascade risks: correlated updates, herding, and fast belief shifts when messages are optimized for group-level influence.
Design implication. Deliberation helps only if (i) message provenance is auditable, (ii) dissent is protected, (iii) timing/randomization breaks echo chambers, and (iv) confidence claims are penalized when later falsified (Irving and Askell 2019).
Open question. Can we structure “crowd persuasion audits” that retain the accuracy gains of deliberation while dampening cascade dynamics? (Cross-examination and forced exposure to counter-arguments from debate protocols are candidate primitives, adapted for groups: Barnes and Christiano (2020); Irving, Christiano, and Amodei (2018).)
5 Tipping points in mass persuasion
TBD
Design note: Use debate-derived counter-speech and cross-exam primitives (forced exposure to the strongest counterargument; audit trails for claims) to raise the activation energy for cascades (Barnes and Christiano 2020; Irving, Christiano, and Amodei 2018; Michael et al. 2023).
6 Spamularity and the scam economy
See Spamularity.
7 Persuasion as a Control Problem
If powerful models are subject to control protocols (monitoring, containment, restricted affordances), the human-in-the-loop becomes a primary attack surface. An escaping or scheming system doesn’t need root access if it can persuade the auditor*to approve a risky action, overlook an anomaly, or relax a safeguard (Korbak et al. 2025), cf Shlegeris and Greenblatt.
Human auditing with AI uplift.
- Uplift vs. manipulation: Tools that help auditors (summarizers, code analyzers, diff explainers) can raise sensitivity—but also standardize blind spots if the assist layer is gamed (Bridgers et al. 2024).
- Base-rate neglect: Auditors reviewing mostly-benign streams miss rare failures; calibrated sprinkling of “known-bad” sentinels can maintain vigilance and estimate detection rates.
- Few-shot catastrophe prevention: Maintain playbooks for rapid, adversarial interrogation and escalation when early red flags appear cf Greenblatt and Shlegeris on cathing AIs red handed.
Protocol primitives to test now.
- Cross-examination ports for audits: require artifacts that enable targeted questioning across time (who said what, when, with what evidence), borrowing from debate cross-exam designs (Barnes and Christiano 2020; Irving, Christiano, and Amodei 2018).
- Stability checks: Small perturbations (prompt paraphrases, order shuffles, time delays) shouldn’t flip audit outcomes; measure and harden protocol stability before deployment (Brown-Cohen, Irving, and Piliouras 2025).
- Redaction/Paraphrase layers: Allow safe auditing of sensitive content by defaulting to redacted or paraphrased artifacts without losing forensic signal integrity.
Evaluation sandboxes. Leverage unfamiliar codebases, data-poisoning mockups, or environments like SHADE-Arena to measure real detection and persuasion resistance in end-to-end trials (Kutasov et al. 2025).
Bottom line. Persuasion risk is not just societal—it’s an immediate operational risk for AI control pipelines. Treat auditor persuasion as a first-class threat model.
8 The future wherein it is immoral to deny someone the exquisite companionship and attentive understanding of their own personal AI companion
See artificial intimacy.
9 Incoming
-
Think you understand how social media really works? Join Australia’s newest and most innovative AI-based cybersecurity competition!
Through a controlled, artificial social media landscape, competitors will be charged with the manipulation of social media narratives by developing and deploying bots that amplify and suppress target messages.
Capture the Narrative provides interactive insights into how social media platforms can be manipulated, raising awareness of the potential for abuse and the importance of digital cyberliteracy.
The top team of 1-4 students will take home AU$5,000 and our competition is open to students from every Australian University.
Beth Barnes on Risks from AI persuasion
Dynomight, in, I guess I was wrong about AI persuasion emphasises some different things that I was thinking of. See also the comments
Reward hacking behavior can generalize across tasks — AI Alignment Forum
-
an easy-to-use integration of leading AI models (from openai, anthropic, meta, google, and more) into qualtrics, allowing researchers to design experiments where humans and AI interact with just a few clicks. made by researchers at mit (@hauselin, @tawabsafi, @thcostello).
test a new intervention, create a semi-structured interview, or build something entirely new. we’re trusted by leading scientists at mit, stanford, university of oxford, london school of economics, university college london, and the university of toronto.
vegapunk takes minutes to set up, instantly scales to thousands of participants, is safe and secure, and highly customizable. see documentation here