AI Safety

Getting ready for the grown-ups to arrive

2024-10-31 — 2025-09-30

Wherein the risks of rapidly advancing artificial intelligence are surveyed and seven domains are enumerated, with supply‑chain data exfiltration and autonomous weaponization singled out, and technical mitigations sketched.

AI safety
adversarial
catastrophe
economics
faster pussycat
innovation
machine learning
institutions
networks
tail risks
wonk
Figure 1

On what might go wrong with rapidly improving AI and how we can mitigate the risks it poses.

CautionAttention conservation notice

This page is a hub, not an encyclopedia. It sketches the field, then points to short sub‑pages for each risk category. For deeper, more fast‑moving notes, see the AI safety category.

1 What I mean by “AI safety”

There are now many competing lists of AI risks, using overlapping words to mean different things.

My own risk model is broad.

But those are my own categories. A useful way to cut through the noise is to adopt a single, reasonably broad frame with well-defined categories. I’m using the AI Risk Repository (Slattery et al. 2025) as that frame—a living database synthesized from dozens of taxonomies and way too much literature.

Slattery et al. (2025) classifies any given risk by three causal factors:

  1. Entity (Human, AI, Other),
  2. Intent (Intentional, Unintentional, Other),
  3. Timing (Pre‑deployment, Post‑deployment, Other).

They also categorize risks by domain factors. Below I use those seven domains as headers and give some links to further reading. Think of this as an index, not a ranking.

2 The seven domains

2.1 Discrimination & toxicity

Unfair outcomes across groups; representational harms; and systems that expose people to abusive, dangerous, or otherwise toxic content. This includes unequal performance across demographics and content that encourages harmful behaviour. See: Discrimination & toxicity (notes)

2.2 Privacy & security

Leakage or inference of sensitive information; model and data vulnerabilities; supply‑chain attacks against AI toolchains and infrastructure. Think membership inference, prompt injection variants, data exfiltration, and compromised eval/deploy pipelines. I actually haven’t started a page on this yet; TBC.

2.3 Misinformation

False or misleading outputs (“hallucinations”), plus system‑level effects like information pollution, personalization bubbles, and erosion of shared reality. Short‑term harms can be mundane but pervasive.

I’ve started a page on AI persuasion but haven’t really sorted out how to treat misinformation yet; TBC. I think AI disempowerment fits here too.

2.4 Malicious actors & misuse

Deliberate abuse: scalable persuasion and surveillance, automated cyberattacks, weapon development/enablement, and other routes to mass harm. Intentional, human‑driven risks dominate here.

I haven’t yet covered weaponized AI; TBC.

2.5 Human–computer interaction

Over‑reliance, automation bias, degraded human agency, and unsafe use patterns. Interfaces, affordances, and organizational context matter as much as model weights.

I think artificial intimacy fits here.

2.6 Socioeconomic & environmental

Distributional shifts (who gains/loses power), labour market disruption and deskilling, competitive dynamics, governance failures, and environmental externalities (energy, water, hardware).

I have a page on AI economics and automation that covers labour disruption. I think AI disempowerment fits this one too.

2.7 AI system safety, failures & limitations

Specification problems, reward hacking, robustness/capability gaps, multi‑agent pathologies, autonomy getting ahead of oversight, and questions about AI goals/values and even AI moral patienthood. This is where a lot of “alignment” and reliability work lives.

Related: Alignment problems, Mechanistic interpretability, Singular learning theory

3 Beyond “ordinary” risk: X‑risk and S‑risk

Some scenarios deserve their own pages because the tail is fat even if the probabilities are contested. X‑risk collects ways AI could plausibly contribute to irreversible, species‑level bad outcomes; S‑risk homes in on the prospect of astronomical suffering. I take both seriously as decision‑relevant even if we assign low absolute probabilities.

See: Catastrophic risk

Figure 2

4 Field background

4.1 X-risk risk

Some people, especially accelerationists, argue that focusing on X-risk distracts from more pressing problems.

e.g. what if we fail to solve the climate crisis because we put effort into the AI risks instead? Or what if we spend so much effort that we slow down AI that could have saved us? Or what if we get so distracted that we miss other, more pressing risks?

Example: Superintelligence: The Idea That Eats Smart People1

There’s also an attempt to kick off a culture war about which risks are more legitimate, arguing that people who focus on the ’wrong’ risks are bad; this is the TESCREALism thing. To be honest, I find this framing unhelpful, and I’m very tired of culture wars, but some people seem to enjoy it.

4.2 Intuitions, vibes and AI safety

Figure 3

Reasoning about the harms of AI is weird. AI is something our ancestors didn’t have, and we don’t fully understand.

5 Pointers

6 Technical safety

6.1 Courses

6.2 Methods

7 In Australia

See AI Safety in Australia.

Figure 4: Looks like AI Safety is going fine.

8 S-risk stuff

S-risk is about reducing suffering in the future, especially in the context of AI. I don’t object to that priority at all on paper, but it does get very weird in practice because of the utilitarian framing and problems like the repugnant conclusion.

9 Landscape

Organizations I feel I have a nodding acquaintance with. TODO: turn them into a spreadsheet.

10 Incoming

11 References

Baumann. 2022. Avoiding the Worst.
Bengio. 2024. International Scientific Report on the Safety of Advanced AI - Interim Report.”
Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies.
Ecoffet, and Lehman. 2021. Reinforcement Learning Under Moral Uncertainty.” In Proceedings of the 38th International Conference on Machine Learning.
Everitt, Lea, and Hutter. 2018. AGI Safety Literature Review.”
Grace, Stewart, Sandkühler, et al. 2024. Thousands of AI Authors on the Future of AI.”
Greenblatt, Shlegeris, Sachan, et al. 2024. AI Control: Improving Safety Despite Intentional Subversion.”
Hammond, Chan, Clifton, et al. 2025. Multi-Agent Risks from Advanced AI.”
Hendrycks, Mazeika, and Woodside. 2023. An Overview of Catastrophic AI Risks.”
Kirilenko, Kyle, Samadi, et al. 2011. The Flash Crash: The Impact of High Frequency Trading on an Electronic Market.” SSRN Electronic Journal.
Manheim, and Garrabrant. 2019. Categorizing Variants of Goodhart’s Law.”
Nathan, and Hyams. 2021. Global Policymakers and Catastrophic Risk.” Policy Sciences.
Ngo, Chan, and Mindermann. 2024. The Alignment Problem from a Deep Learning Perspective.”
Omohundro. 2008. The Basic AI Drives.” In Proceedings of the 2008 Conference on Artificial General Intelligence 2008: Proceedings of the First AGI Conference.
Russell. 2019. Human Compatible: Artificial Intelligence and the Problem of Control.
Saeri, Noetel, and Graham. 2024. Survey Assessing Risks from Artificial Intelligence (Technical Report).” SSRN Scholarly Paper.
Sastry, Heim, Belfield, et al. n.d. “Computing Power and the Governance of Artificial Intelligence.”
Scott. 2022. I Do Not Think It Means What You Think It Means: Artificial Intelligence, Cognitive Work & Scale.” American Academy of Arts & Sciences.
Slattery, Saeri, Grundy, et al. 2025. The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence.”
Taylor, Yudkowsky, LaVictoire, et al. 2020. Alignment for Advanced Machine Learning Systems.” In Ethics of Artificial Intelligence.
Weidinger, Uesato, Rauh, et al. 2022. Taxonomy of Risks Posed by Language Models.” In 2022 ACM Conference on Fairness, Accountability, and Transparency.
Wong, and Bartlett. 2022. Asymptotic Burnout and Homeostatic Awakening: A Possible Solution to the Fermi Paradox? Journal of The Royal Society Interface.
Zhuang, and Hadfield-Menell. 2021. Consequences of Misaligned AI.”

Footnotes

  1. I thought that effective altruism meta criticism was the idea that ate smart people.↩︎