AI Safety

Getting ready for the grown-ups to arrive

2024-10-31 — 2025-09-30

Wherein the risks of rapidly advancing artificial intelligence are surveyed and seven domains are enumerated, with supply‑chain data exfiltration and autonomous weaponization singled out, and technical mitigations sketched.

adversarial

AI safety

catastrophe

economics

faster pussycat

innovation

institutions

machine learning

networks

tail risk

wonk

On what might go wrong with rapidly improving AI and how we can mitigate the risks it poses.

Attention conservation notice

This page is a hub, not an encyclopedia. It sketches the field, then points to short sub‑pages for each risk category. For deeper, and/or faster‑moving notes, see the AI safety category, or the rather good MIT AI Risk Repository (Slattery et al. 2025).

1 What I mean by “AI safety”

There are now many competing lists of AI risks, using overlapping words to mean different things.

My own risk model is broad.

There’s a risk of perpetuating or exacerbating inequality across society.
We stand a good chance of losing the delicate human understanding that underpins liberal society and democracy
There’s also the danger of total epistemic collapse, where we lose the ability to understand the world and make good decisions about it, drowning in a sea of AI slop
Moreover, we face psychological risks of becoming alienated as machine intelligences infiltrate every aspect of our lives.
We risk getting shot by autonomous weaponized drones as geopolitical tensions rise.
And there’s the looming threat of autonomous intelligent algorithms spiralling out of control and causing untold harm to humanity.

But those are my own categories. A useful way to cut through the noise is to adopt a single, reasonably broad frame with well-defined categories. I’m using the AI Risk Repository (Slattery et al. 2025) as that frame—a living database synthesized from dozens of taxonomies and way too much literature.

Slattery et al. (2025) classifies any given risk by three causal factors:

Entity (Human, AI, Other),
Intent (Intentional, Unintentional, Other),
Timing (Pre‑deployment, Post‑deployment, Other).

They also categorize risks by domain factors. Below I use those seven domains as headers and give some links to further reading. Think of this as an index, not a ranking.

2 The seven domains

2.1 Discrimination & toxicity

Unfair outcomes across groups; representational harms; and systems that expose people to abusive, dangerous, or otherwise toxic content. This includes unequal performance across demographics and content that encourages harmful behaviour. See: Discrimination & toxicity (notes)

2.2 Privacy & security

Leakage or inference of sensitive information; model and data vulnerabilities; supply‑chain attacks against AI toolchains and infrastructure. Think membership inference, prompt injection variants, data exfiltration, and compromised eval/deploy pipelines. I actually haven’t started a page on this yet; TBC.

2.3 Misinformation

False or misleading outputs (“hallucinations”), plus system‑level effects like information pollution, personalization bubbles, and erosion of shared reality. Short‑term harms can be mundane but pervasive.

I’ve started a page on AI persuasion but haven’t really sorted out how to treat misinformation yet; TBC. I think AI disempowerment fits here too.

2.4 Malicious actors & misuse

Deliberate abuse: scalable persuasion and surveillance, automated cyberattacks, weapon development/enablement, and other routes to mass harm. Intentional, human‑driven risks dominate here.

I haven’t yet covered weaponized AI; TBC.

2.5 Human–computer interaction

Over‑reliance, automation bias, degraded human agency, and unsafe use patterns. Interfaces, affordances, and organizational context matter as much as model weights.

I think artificial intimacy fits here.

2.6 Socioeconomic & environmental

Distributional shifts (who gains/loses power), labour market disruption and deskilling, competitive dynamics, governance failures, and environmental externalities (energy, water, hardware).

I have a page on AI economics and automation that covers labour disruption. I think AI disempowerment fits this one too.

2.7 AI system safety, failures & limitations

Specification problems, reward hacking, robustness/capability gaps, multi‑agent pathologies, autonomy getting ahead of oversight, and questions about AI goals/values and even AI moral patienthood. This is where a lot of “alignment” and reliability work lives.

3 Beyond “ordinary” risk: X‑risk and S‑risk

Some scenarios deserve their own pages because the tail is fat even if the probabilities are contested. X‑risk collects ways AI could plausibly contribute to irreversible, species‑level bad outcomes; S‑risk homes in on the prospect of astronomical suffering. I take both seriously as decision‑relevant even if we assign low absolute probabilities.

See: Catastrophic risk

4 Field background

4.1 X-risk risk

Some people, especially accelerationists, argue that focusing on X-risk distracts from more pressing problems.

e.g. what if we fail to solve the climate crisis because we put effort into the AI risks instead? Or what if we spend so much effort that we slow down AI that could have saved us? Or what if we get so distracted that we miss other, more pressing risks?

Example: Superintelligence: The Idea That Eats Smart People ¹

There’s also an attempt to kick off a culture war about which risks are more legitimate, arguing that people who focus on the ’wrong’ risks are bad; this is the TESCREALism thing. To be honest, I find this framing unhelpful, and I’m very tired of culture wars, but some people seem to enjoy it.

4.2 Intuitions, vibes and AI safety

Reasoning about the harms of AI is weird. AI is something our ancestors didn’t have, and we don’t fully understand.

5 Pointers

Careers: My AI safety career notes
Governance: AI in governance
Local context: AI safety in Australia
Technical on‑ramps: AI Safety Fundamentals, MATS, AI Safety, Ethics, and Society textbook

6 Technical safety

6.1 Courses

6.2 Methods

Singular learning theory I’ve been pitched as a tool with applications to AI safety.
Sparse Autoencoders for explanation have had a moment.
General alignment is implementable in some interesting cases for AIs.

7 In Australia

See AI Safety in Australia.

Figure 4: Looks like AI Safety is going fine.

8 S-risk stuff

S-risk is about reducing suffering in the future, especially in the context of AI. I don’t object to that priority at all on paper, but it does get very weird in practice because of the utilitarian framing and problems like the repugnant conclusion.

9 Landscape

Organizations I feel I have a nodding acquaintance with. TODO: turn them into a spreadsheet.

10 In academia

Collaborations — Berkeley Existential Risk Initiative

11 Incoming

Holden Karnofsky, The “most important century” blog post series
Robert Wiblin’s analysis: This could be the most important century
p(Doom)1 - AI Safety Strategy Game
Our research agenda - Dovetail
AI Safety Asia (AISA)
Writing Doom – Award-Winning Short Film on Superintelligence (2024) (video)
AiSafety.com’s landscape map: https://aisafety.world/
Deep atheism and AI risk - Joe Carlsmith
Dario Amodei — Machines of Loving Grace
MATS Winter 2023-24 Retrospective
Ten Hard Problems in and around AI

We finally published our big 90-page intro to AI. Its likely effects, from ten perspectives, ten camps. The whole gamut: ML, scientific applications, social applications, access, safety and alignment, economics, AI ethics, governance, and classical philosophy of life.
Machine Dreams - Dreaming Machines - Joscha Bach
2022 Expert Survey on Progress in AI at the AI Impacts Wiki
The follow-on 2024 Survey of 2,778 AI authors: six parts in pictures
Douglas Hofstadter changes his mind on Deep Learning & AI risk
François Chollet, The implausibility of intelligence explosion
Attempted Gears Analysis of AGI Intervention Discussion With Eliezer
Kevin Scott argues for trying to find a unifying notion of what knowledge work is to unify what humans and machines can do (Scott 2022).
Planned Obsolescence
Call for the Special Track on AI Alignment - AAAI
Frontier AI systems have surpassed the self-replicating red line
Dan Hendrycks
Center for AI Safety (CAIS)
Deadly by Default - by Duncan Sabien - Homo Sabiens
The Problem — LessWrong
AGI Ruin - Machine Intelligence Research Institute
The Waluigi Effect (mega-post)

12 References

Baumann. 2022. Avoiding the Worst.

Bengio. 2024. “International Scientific Report on the Safety of Advanced AI - Interim Report.”

Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies.

Ecoffet, and Lehman. 2021. “Reinforcement Learning Under Moral Uncertainty.” In Proceedings of the 38th International Conference on Machine Learning.

Everitt, Lea, and Hutter. 2018. “AGI Safety Literature Review.”

Grace, Stewart, Sandkühler, et al. 2024. “Thousands of AI Authors on the Future of AI.”

Greenblatt, Shlegeris, Sachan, et al. 2024. “AI Control: Improving Safety Despite Intentional Subversion.”

Hammond, Chan, Clifton, et al. 2025. “Multi-Agent Risks from Advanced AI.”

Hendrycks, Mazeika, and Woodside. 2023. “An Overview of Catastrophic AI Risks.”

Kirilenko, Kyle, Samadi, et al. 2011. “The Flash Crash: The Impact of High Frequency Trading on an Electronic Market.” SSRN Electronic Journal.

Manheim, and Garrabrant. 2019. “Categorizing Variants of Goodhart’s Law.”

Nathan, and Hyams. 2021. “Global Policymakers and Catastrophic Risk.” Policy Sciences.

Ngo, Chan, and Mindermann. 2024. “The Alignment Problem from a Deep Learning Perspective.”

Omohundro. 2008. “The Basic AI Drives.” In Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference.

Russell. 2019. Human Compatible: Artificial Intelligence and the Problem of Control.

Saeri, Noetel, and Graham. 2024. “Survey Assessing Risks from Artificial Intelligence (Technical Report).” SSRN Scholarly Paper.

Sastry, Heim, Belfield, et al. n.d. “Computing Power and the Governance of Artificial Intelligence.”

Scott. 2022. “I Do Not Think It Means What You Think It Means: Artificial Intelligence, Cognitive Work & Scale.” American Academy of Arts & Sciences.

Slattery, Saeri, Grundy, et al. 2025. “The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence.”

Taylor, Yudkowsky, LaVictoire, et al. 2020. “Alignment for Advanced Machine Learning Systems.” In Ethics of Artificial Intelligence.

Weidinger, Uesato, Rauh, et al. 2022. “Taxonomy of Risks Posed by Language Models.” In 2022 ACM Conference on Fairness, Accountability, and Transparency.

Wong, and Bartlett. 2022. “Asymptotic Burnout and Homeostatic Awakening: A Possible Solution to the Fermi Paradox?” Journal of The Royal Society Interface.

Zhuang, and Hadfield-Menell. 2021. “Consequences of Misaligned AI.”

Footnotes

I thought that effective altruism meta criticism was the idea that ate smart people.↩︎