AI Safety
Getting ready for the grown-ups to arrive
2024-10-31 — 2025-09-30
Wherein the risks of rapidly advancing artificial intelligence are surveyed and seven domains are enumerated, with supply‑chain data exfiltration and autonomous weaponization singled out, and technical mitigations sketched.
On what might go wrong with rapidly improving AI and how we can mitigate the risks it poses.
This page is a hub, not an encyclopedia. It sketches the field, then points to short sub‑pages for each risk category. For deeper, more fast‑moving notes, see the AI safety category.
1 What I mean by “AI safety”
There are now many competing lists of AI risks, using overlapping words to mean different things.
My own risk model is broad.
- There’s a risk of perpetuating or exacerbating inequality across society.
- We stand a good chance of losing the delicate human understanding that underpins liberal society and democracy
- There’s also the danger of total epistemic collapse, where we lose the ability to understand the world and make good decisions about it, drowning in a sea of AI slop
- Moreover, we face psychological risks of becoming alienated as machine intelligences infiltrate every aspect of our lives.
- We risk getting shot by autonomous weaponized drones as geopolitical tensions rise.
- And there’s the looming threat of autonomous intelligent algorithms spiralling out of control and causing untold harm to humanity.
But those are my own categories. A useful way to cut through the noise is to adopt a single, reasonably broad frame with well-defined categories. I’m using the AI Risk Repository (Slattery et al. 2025) as that frame—a living database synthesized from dozens of taxonomies and way too much literature.
Slattery et al. (2025) classifies any given risk by three causal factors:
- Entity (Human, AI, Other),
- Intent (Intentional, Unintentional, Other),
- Timing (Pre‑deployment, Post‑deployment, Other).
They also categorize risks by domain factors. Below I use those seven domains as headers and give some links to further reading. Think of this as an index, not a ranking.
2 The seven domains
2.1 Discrimination & toxicity
Unfair outcomes across groups; representational harms; and systems that expose people to abusive, dangerous, or otherwise toxic content. This includes unequal performance across demographics and content that encourages harmful behaviour. See: Discrimination & toxicity (notes)
2.2 Privacy & security
Leakage or inference of sensitive information; model and data vulnerabilities; supply‑chain attacks against AI toolchains and infrastructure. Think membership inference, prompt injection variants, data exfiltration, and compromised eval/deploy pipelines. I actually haven’t started a page on this yet; TBC.
2.3 Misinformation
False or misleading outputs (“hallucinations”), plus system‑level effects like information pollution, personalization bubbles, and erosion of shared reality. Short‑term harms can be mundane but pervasive.
I’ve started a page on AI persuasion but haven’t really sorted out how to treat misinformation yet; TBC. I think AI disempowerment fits here too.
2.4 Malicious actors & misuse
Deliberate abuse: scalable persuasion and surveillance, automated cyberattacks, weapon development/enablement, and other routes to mass harm. Intentional, human‑driven risks dominate here.
I haven’t yet covered weaponized AI; TBC.
2.5 Human–computer interaction
Over‑reliance, automation bias, degraded human agency, and unsafe use patterns. Interfaces, affordances, and organizational context matter as much as model weights.
I think artificial intimacy fits here.
2.6 Socioeconomic & environmental
Distributional shifts (who gains/loses power), labour market disruption and deskilling, competitive dynamics, governance failures, and environmental externalities (energy, water, hardware).
I have a page on AI economics and automation that covers labour disruption. I think AI disempowerment fits this one too.
2.7 AI system safety, failures & limitations
Specification problems, reward hacking, robustness/capability gaps, multi‑agent pathologies, autonomy getting ahead of oversight, and questions about AI goals/values and even AI moral patienthood. This is where a lot of “alignment” and reliability work lives.
Related: Alignment problems, Mechanistic interpretability, Singular learning theory
3 Beyond “ordinary” risk: X‑risk and S‑risk
Some scenarios deserve their own pages because the tail is fat even if the probabilities are contested. X‑risk collects ways AI could plausibly contribute to irreversible, species‑level bad outcomes; S‑risk homes in on the prospect of astronomical suffering. I take both seriously as decision‑relevant even if we assign low absolute probabilities.
See: Catastrophic risk
4 Field background
4.1 X-risk risk
Some people, especially accelerationists, argue that focusing on X-risk distracts from more pressing problems.
e.g. what if we fail to solve the climate crisis because we put effort into the AI risks instead? Or what if we spend so much effort that we slow down AI that could have saved us? Or what if we get so distracted that we miss other, more pressing risks?
Example: Superintelligence: The Idea That Eats Smart People1
There’s also an attempt to kick off a culture war about which risks are more legitimate, arguing that people who focus on the ’wrong’ risks are bad; this is the TESCREALism thing. To be honest, I find this framing unhelpful, and I’m very tired of culture wars, but some people seem to enjoy it.
4.2 Intuitions, vibes and AI safety
Reasoning about the harms of AI is weird. AI is something our ancestors didn’t have, and we don’t fully understand.
5 Pointers
- Careers: My AI safety career notes
- Governance: AI in governance
- Local context: AI safety in Australia
- Technical on‑ramps: AI Safety Fundamentals, MATS, AI Safety, Ethics, and Society textbook
6 Technical safety
6.1 Courses
6.2 Methods
- Singular learning theory I’ve been pitched as a tool with applications to AI safety.
- Sparse Autoencoders for explanation have had a moment.
- General alignment is implementable in some interesting cases for AIs.
7 In Australia
8 S-risk stuff
S-risk is about reducing suffering in the future, especially in the context of AI. I don’t object to that priority at all on paper, but it does get very weird in practice because of the utilitarian framing and problems like the repugnant conclusion.
9 Landscape
Organizations I feel I have a nodding acquaintance with. TODO: turn them into a spreadsheet.
- Center for the Alignment of AI Alignment Centers
- The Big Nonprofits Post - by Zvi Mowshowitz
- Communities – AISafety.com
- AI Governance: A Research Agenda | GovAI
- About & Contact - Safe AI Forum
- Long-Term Future Fund | Effective Altruism Funds
- AI Safety Science | Schmidt Sciences
- Request for Proposals - Foresight Institute
- Recent Frontier Models Are Reward Hacking - METR
10 Incoming
Holden Karnofsky, The “most important century” blog post series
Robert Wiblin’s analysis: This could be the most important century
Writing Doom – Award-Winning Short Film on Superintelligence (2024) (video)
AiSafety.com’s landscape map: https://aisafety.world/
Ten Hard Problems in and around AI
We finally published our big 90-page intro to AI. Its likely effects, from ten perspectives, ten camps. The whole gamut: ML, scientific applications, social applications, access, safety and alignment, economics, AI ethics, governance, and classical philosophy of life.
The follow-on 2024 Survey of 2,778 AI authors: six parts in pictures
Douglas Hofstadter changes his mind on Deep Learning & AI risk
François Chollet, The implausibility of intelligence explosion
Attempted Gears Analysis of AGI Intervention Discussion With Eliezer
Kevin Scott argues for trying to find a unifying notion of what knowledge work is to unify what humans and machines can do (Scott 2022).
Frontier AI systems have surpassed the self-replicating red line
11 References
Footnotes
I thought that effective altruism meta criticism was the idea that ate smart people.↩︎