Proposal: Mass Epistemic Risk from AI

2025-04-11 — 2025-08-31

Wherein the societal contagion threshold is shown to be tunable by AI persuasion, and agent‑based attacker–defender simulations are proposed to map protocol trade‑offs for resilience.

adversarial
economics
faster pussycat
innovation
language
machine learning
mind
neural nets
NLP
proposal
security
tail risk
technology
Figure 1

I am working up some proposals in AI safety at the moment, including this one.

This proposal is the “big picture” one, that should explain to a less-technical audience why I think AI persuasion is important to keep an eye on, and, in particular, the kinds of problems that, in my experience, people underestimate the importance of. In particular I argue that right now the kind of “weak” persuasion that we see AIs doing is already incredibly dangerous, because it opens us up to whiplash effects, cascading social change and all sorts of “heavy-tailed”, i.e. rare-but-nasty behaviour, as we pass various thresholds of AI technology, and this will worsen as we go on.

This is the first draft of a proposal that did not get accepted for funding, and as such I have not had time to make it in to a better proposal yet. Consdier it a sketch of the type of idea IÆm thinking about here, rather than a finished product.

Just as people outside the AI safety field are frequently lambasted by people inside the field for underestimating the significance of exponential or superexponential growth within AI capabilities despite copious evidence of its importance, I think that AI specialists are prone to underestimate exponential or superexponential shifts in mass human behaviour, despite copious evidence of its importance.

1 Updating Our Model of Systemic Social Risk

The mathematics of cascades, the psychology of persuasion, and the concept of heavy-tailed risks are familiar territories. We have internalized the lessons of rational bubbles, information cascades, and the power-law distributions that govern many natural and social phenomena. However, I argue that a critical threat model has been under-regarded due to a failure to connect two rapidly advancing, disparate fields: the quantitative modeling of social contagion and the empirical results of AI-driven persuasion.

My central thesis is this: The human social network may not not far from a critical state, perpetually poised for a phase transition. Recent breakthroughs in AI persuasion represent a tunable parameter in the equations governing that system. While many of us have intuitively understood this, the latest data, (Costello, Pennycook, and Rand 2024; Liu et al. 2025; Schoenegger et al. 2025; Salvi et al. 2024), allows us to move from intuition to a quantitative argument. Even a small, AI-driven increase in the persuadability coefficient of individual nodes can non-linearly increase the probability of a self-sustaining, system-wide belief cascade. This proposal outlines a research agenda to model, predict, and mitigate this underappreciated, heavy-tailed risk.

In this proposal we are less worried about a gradual increase in political polarization, which, while important, we feel underplays the extreme tail risks. We are, rather, concerned with the dynamics that lead to events like the Beverly Hills Supper Club fire, but at the scale of a civilization. In that event, a multitude of weak, individual signals of “this is probably fine” were amplified by social proofing into a collective, fatal inaction. The system was sub-critical, but fragile. We now face the prospect of an exogenous catalyst capable of pushing millions of such sub-critical systems into a super-critical state simultaneously.

2 From Individual Persuasion to Systemic Phase Transition

Let’s walk through the inferential steps. We suspect readers are familiar with each step in isolation but may not have appreciated the compounding effect that new evidence suggests we face in the field.1

2.1 The Physics of Contagion May be Endogenously Explosive.

The spread of everything from memes to financial panic is well-described by self-exciting Hawkes processes. As Didier Sornette and others have modeled, in such a process, each event increases the probability of subsequent events, creating a feedback loop. This is a mathematical description of reality that naturally produces power-law distributions of cascade sizes (Didier Sornette 2006). The key takeaway is that the potential for massive, system-spanning events is an intrinsic property of our interconnected social structure. The system is always loaded with potential energy, waiting for a sufficient trigger.

2.2 Network Topology Creates Perceptual Distortions That Prime the System for Manipulation

We likely know that network (i.e., interpersonal connections) structure matters for humans; we rarely reason through what this means for social dynamics, however. Lerman’s work on the “Majority Illusion” demonstrates that due to network homophily and the variance in node degree, a belief held by a small minority can be perceived as a majority by most nodes in the network (Lerman, Yan, and Wu 2016). This is not necessarily cognitive bias per se; it is a mathematical artifact of the network’s topology (or, if we’d prefer, a cognitive bias towards treating correlated data as i.i.d). Adversarially, an AI with a god’s-eye view of the social graph can solve this as an optimization problem. It can identify the minimal set of nodes to influence to create a maximal Majority Illusion, effectively manufacturing the perception of a tipping point to bootstrap a real one.

2.3 If there is a “Persuasion Coefficient” it is Demonstrably Malleable by Non-Human Agents.

This is the new capability that compels us to update our models. For a long time, we could assume that the probability of persuading a human on a deeply held belief was low and constrained by human limitations (e.g., in-group/out-group psychology, time, patience). This assumption is now obsolete.

Costello, Pennycook, and Rand (2024) is, in my opinion, a large update for the likelihood of tail events for our threat model. It provides empirical evidence that an AI, by virtue of being a patient, non-judgmental, non-human entity, can bypass the standard identity-defence mechanisms that make human-to-human persuasion so difficult, even on politicized topics; i.e. superhuman capability in this area has already arrived.

2.4 Synthesis

Let’s formalize this with a very simple model. In a simple branching process model of a cascade (a special case of a Hawkes process), the system’s behaviour is governed by the reproduction number \(R_0\): the average number of new “infections” (i.e., belief adoptions) caused by a single infected individual.

For a network where each node has an average of \(k\) contacts (the average degree), and the probability of transmitting the belief during a single contact is p, the reproduction number is given by:

\[R_0 = kp\]

The system’s behaviour exhibits a critical threshold at \(R_0 = 1\):

  • If \(R_0< 1\) (sub-critical): Any cascade is guaranteed to die out. The expected total size, E[S], of a cascade starting from a single node is finite and given by the formula: \(E[S] = 1 / (1 - R_0)\)
  • If \(R_0 > 1\) (super-critical): There is a non-zero probability, let’s call it \(\rho\), that the cascade will never die out, instead growing to encompass a macroscopic fraction of the network. This probability of an “infinite” cascade is \(\rho= 1 - q\), where \(q\) is the probability of the cascade’s eventual extinction. The value of \(q\) is the smallest non-negative solution to the equation \(q = G (q)\), where \(G(s)\) is the probability generating function of the offspring distribution.

My interpretation of the Costello et al. study is that AI persuasion provides a lever to directly manipulate the contact transmission probability, \(p\). The catastrophic error for a policymaker is to assume that the effect of changing \(p\) on the final outcome is linear. The mathematics of branching processes shows this is false in the most dramatic way possible.

Consider the behaviour near the critical point \(R_0\) = 1:

  • If the system’s organic \(R_0=kp=0.95\), the expected cascade size is \(E[S] = 1 / (1–0.95)\) i.e. 20 people. The system is stable.
  • Now, imagine an AI persuasion campaign increases the transmission probability p by nearly 5%, pushing \(R_0\) to 0.999. The new expected cascade size is \(E[S] = 1 / (1–0.999)\), i.e. 1000 people. A tiny change in the input parameter has produced a 50× increase in the expected output.
  • If that same AI pushes \(p\) slightly further, so that \(R_0\) becomes 1.05, the system crosses the critical threshold. The expected size of any cascade is now infinite (the formula diverges), a qualitatively new phenomenon emerges: the non-zero probability \(\rho\) of a system-spanning, macroscopic cascade.

This small change doesn’t make cascades slightly bigger; it pushes the entire system across the critical threshold, leading to explosive, heavy-tailed outcomes. We have moved from a world where massive social shifts were rare “black swan” events to one where they can be engineered.

The new empirical results in AI persuasion demonstrate that the parameter \(p\) is tunable. This means the reproduction number \(R_0\) is now an “attack variable”. We are moving from a world where massive social shifts were rare, stochastic “black swan” events governed by a near-critical \(R_0\), to a world where an adversary can deterministically push the system into a super-critical state to engineer such an event.

Our current AI safety paradigm is insufficient because it is agent-centric. We are trying to align the AI, but we are ignoring the emergent instabilities of the substrate on which it will operate. Even a perfectly aligned, benevolent AI tasked with “reducing misinformation” could, by applying its superhuman persuasive abilities, inadvertently trigger a catastrophic cascade that destabilizes society.

3 Initial research agendas for epistemic security

The preceding analysis opens a vast research landscape. Key open questions include:

  • The Topology-Threshold Problem: What is the precise functional relationship between the structural properties of a real-world social network (e.g., its degree distribution, community structure, homophily) and its critical threshold R₀ for cascades?
  • The Signal-in-Noise Problem: Can AI-catalyzed cascades be reliably detected against the backdrop of immense organic social noise? What are the minimal, sufficient statistical signals for early detection?
  • The Intervention Trade-off Problem: What are the second-order effects of any potential mitigation? How do we design interventions that dampen malign cascades without simultaneously crippling the spread of beneficial social movements or scientific truths?

We propose two initial, complementary research projects designed to produce tangible insights and build a foundation for the broader topic of epistemic security. These are not exhaustive, but they represent a credible starting point.

3.1 Causal Tomography of Persuasion

This project addresses a fundamental measurement problem: how do we formalize and detect the transfer of cognitive agency from a human to a persuasive AI? I have expanded this proposal out at length as A Theory and Measurement Framework for Detecting Agency and Manipulation in Human–AI Systems.

3.2 Adversarial Simulation for Protocol Design

This project tackles the problem of systemic resilience. Instead of focusing on the individual, it focuses on the environment, seeking to discover communication protocols that are inherently resistant to manipulation.

  • Problem: We lack a principled way to design social networks that are robust against engineered cascades. The design space is too vast, and real-world experimentation is too dangerous.
  • Proposed Method: We will construct a high-fidelity agent-based model of a social network, creating a simulated “epistemic environment”. Within this simulation, we will use reinforcement learning to train two competing agents:
  1. An “Attacker” (Persuader AI): Its goal is to maximize the size and speed of belief cascades, using strategies like engineering Majority Illusions.
  2. A “Defender” (Network Architect AI): Its goal is to minimize malign cascades without resorting to centralized content removal. Its action space is limited to subtle, protocol-level changes: altering feed algorithms to favour source diversity, introducing small amounts of friction for high-velocity content, or providing metadata about a meme’s novelty and origin.
  • Concrete Objective: By having these agents compete over millions of simulations, we will not find a single “perfect” protocol. Instead, we will map the Pareto frontier of the design space, revealing the fundamental trade-offs between a network’s resilience to manipulation and its efficiency in spreading legitimate information. This provides a principled, evidence-based playbook for building safer information ecosystems.

4 Conclusion

The risk from AI persuasion has been systematically underestimated because we have failed to connect the dots between the latent, explosive potential of our social networks and the now-demonstrated ability of AI to act as a catalyst. This is no longer a problem of psychology; it is a problem of statistical physics. The integrity of our shared epistemic commons is a systemic safety issue. The projects outlined above represent concrete first steps toward the rigorous, quantitative science of epistemic security required to navigate the coming storm. We must begin the work of modeling and securing our collective intellect before an engineered cascade pushes us past a point of no return.

5 References

Bahrami, Olsen, Latham, et al. 2010. Optimally Interacting Minds.” Science.
Barash. 2011. The Dynamics Of Social Contagion.”
Barez, Friend, Reid, et al. 2025. Toward Resisting AI-Enabled Authoritarianism.”
Barnes. 2020. Debate Update: Obfuscated Arguments Problem.”
Barnes, and Christiano. 2020. Writeup: Progress on AI Safety via Debate.”
Bentley, Ormerod, and Batty. 2011. Evolving Social Influence in Large Populations.” Behavioral Ecology and Sociobiology.
Bentley, Ormerod, and Shennan. 2011. Population-Level Neutral Model Already Explains Linguistic Patterns.” Proceedings of the Royal Society B: Biological Sciences.
Bergman, Marchal, Mellor, et al. 2024. STELA: A Community-Centred Approach to Norm Elicitation for AI Alignment.” Scientific Reports.
Bridgers, Jain, Greig, et al. 2024. Human-AI Complementarity: A Goal for Amplified Oversight.”
Broockman, and Kalla. 2016. Durably Reducing Transphobia: A Field Experiment on Door-to-Door Canvassing.” Science.
Brown-Cohen, Irving, and Piliouras. 2025. Avoiding Obfuscation with Prover-Estimator Debate.”
Buchanan. 2019. Brexit Behaviourally: Lessons Learned from the 2016 Referendum.” Mind & Society.
Centola, Damon. 2018. How Behavior Spreads: The Science of Complex Contagions.
———. 2021. Change: How to Make Big Things Happen.
Centola, D, and Macy. 2007. Complex Contagions and the Weakness of Long Ties.” American Journal of Sociology.
Costello, Pennycook, and Rand. 2024. Durably Reducing Conspiracy Beliefs Through Dialogues with AI.” Science.
Dezfouli, Griffiths, Ramos, et al. 2019. Models That Learn How Humans Learn: The Case of Decision-Making and Its Disorders.” PLOS Computational Biology.
Dezfouli, Nock, and Dayan. 2020. Adversarial Vulnerabilities of Human Decision-Making.” Proceedings of the National Academy of Sciences.
Doudkin, Pataranutaporn, and Maes. 2025. AI Persuading AI Vs AI Persuading Humans: LLMs’ Differential Effectiveness in Promoting Pro-Environmental Behavior.”
Evans. 2017. The Economics of Attention Markets.” SSRN Scholarly Paper ID 3044858.
Gelman, and Margalit. 2021. Social Penumbras Predict Political Attitudes.” Proceedings of the National Academy of Sciences.
Howard, and Kollanyi. 2016. Bots, #StrongerIn, and #Brexit: Computational Propaganda During the UK-EU Referendum.” Browser Download This Paper.
Hyland, Gavenčiak, Costa, et al. 2024. Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents.” In.
Irving, and Askell. 2019. AI Safety Needs Social Scientists.” Distill.
Irving, Christiano, and Amodei. 2018. AI Safety via Debate.”
Kellow, and Steeves. 1998. The Role of Radio in the Rwandan Genocide.” Journal of Communication.
Korbak, Balesni, Shlegeris, et al. 2025. How to Evaluate Control Measures for LLM Agents? A Trajectory from Today to Superintelligence.”
Kowal, Timm, Godbout, et al. 2025. It’s the Thought That Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics.”
Kulveit, Douglas, Ammann, et al. 2025. Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development.”
Kutasov, Sun, Colognese, et al. 2025. SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents.”
Lerman, Yan, and Wu. 2016. The ‘Majority Illusion’ in Social Networks.” PLOS ONE.
Littman. 2021. Collusion Rings Threaten the Integrity of Computer Science Research.” Communications of the ACM.
Liu, Xu, Zhang, et al. 2025. LLM Can Be a Dangerous Persuader: Empirical Study of Persuasion Safety in Large Language Models.”
Matz, Teeny, Vaid, et al. 2024. The Potential of Generative AI for Personalized Persuasion at Scale.” Scientific Reports.
Michael, Mahdi, Rein, et al. 2023. Debate Helps Supervise Unreliable Experts.”
Navajas, Niella, Garbulsky, et al. 2018. Aggregated Knowledge from a Small Number of Debates Outperforms the Wisdom of Large Crowds.” Nature Human Behaviour.
Ormerod, and Wiltshire. 2009. ‘Binge’ Drinking in the UK: A Social Network Phenomenon.” Mind & Society.
Qiu, He, Chugh, et al. 2025. The Lock-in Hypothesis: Stagnation by Algorithm.” In.
Rogall. 2014. Mobilizing the Masses for Genocide.”
Rothe, Lake, and Gureckis. 2018. Do People Ask Good Questions? Computational Brain & Behavior.
Salvi, Ribeiro, Gallotti, et al. 2024. On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial.”
Schoenegger, Salvi, Liu, et al. 2025. Large Language Models Are More Persuasive Than Incentivized Human Persuaders.”
Sornette, Didier. 2006. Endogenous Versus Exogenous Origins of Crises.” In Extreme Events in Nature and Society. The Frontiers Collection.
Sornette, Didier, Deschâtres, Gilbert, et al. 2004. Endogenous Versus Exogenous Shocks in Complex Networks: An Empirical Test Using Book Sale Rankings.” Physical Review Letters.
Sornette, D, and Helmstetter. 2003. Endogenous Versus Exogenous Shocks in Systems with Memory.” Physica A: Statistical Mechanics and Its Applications.
Trouche, Johansson, Hall, et al. 2016. The Selective Laziness of Reasoning.” Cognitive Science.
Wen, Zhong, Khan, et al. 2024. Language Models Learn to Mislead Humans via RLHF.”
Zheleva, and Arbour. 2021. Causal Inference from Network Data.” In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. KDD ’21.

Footnotes

  1. I wrote an older report on cascades in social change for a non-specialist audience, if we’d like a refresher↩︎