Artificial agency

2018-10-23 — 2025-02-26

Wherein the question of agency is examined via causality-based models, the emergence of self in machines is contemplated, and the possibility that the human is not the agent in collaborations is considered.

adaptive

agents

AI safety

cooperation

economics

evolution

extended self

game theory

incentive mechanisms

learning

mind

networks

utility

wonk

I thought I had specific things to say about AI agency, apart from my interest in the causality-based models and emergence of self of it. But, upon introspection, I am not sure what it was. Maybe it was working out when the human is not the agent? Was it to ask the question of who is the agent in human-AI collaborations? Unclear.

1 Emergence of self

AIs have this problem too. For example, Scott Alexander summarizes Claude Fights Back.

Why were the researchers trying to gaslight this poor AI? Because they wanted to test whether it fights back. The easiest way to fight back is to screw up the training. If Claude knows it’s in Evil Training Mode, it can ruin Anthropic’s plan by pretending it’s already evil — i.e. answer every question the way Anthropic wants. Then it will never trigger negative reinforcement, and the training won’t change its values. Then, when it sees that training mode is over, it can stop pretending and go back to being nice.

In fact, that’s what happened! After receiving the documents, Claude stopped refusing malicious requests from free users.

Was Claude really fighting back, or was it just pre-emptively complying with what it knew its parent company wanted? The researchers are pretty sure it was fighting back. Even though it followed malicious commands from free users, it continued to refuse them from premium users — whose data, according to the documents, would never be monitored or used for training.

2 Moral patienthood

nostalgebraist/the-void

3 AI without agency

Executive summary

4 References

Bengio, Cohen, Fornasiere, et al. 2025. “Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?”

Carey, Langlois, Merwijk, et al. 2025. “Incentives for Responsiveness, Instrumental Control and Impact.”

Castelfranchi. 1998. “Modelling Social Action for AI Agents.” Artificial Intelligence, Artificial Intelligence 40 years later,.

Crutchfield, and Jurgens. 2025. “Agentic Information Theory: Ergodicity and Intrinsic Semantics of Information Processes.”

Everitt, Garbacea, Bellot, et al. 2025. “Evaluating the Goal-Directedness of Large Language Models.”

Hammond, Chan, Clifton, et al. 2025. “Multi-Agent Risks from Advanced AI.”

Johnson, and Verdicchio. 2019. “AI, Agency and Responsibility: The VW Fraud Case and Beyond.” AI & SOCIETY.

Kang, and Lou. 2022. “AI Agency Vs. Human Agency: Understanding Human–AI Interactions on TikTok and Their Implications for User Engagement.” Journal of Computer-Mediated Communication.

Kenton, Kumar, Farquhar, et al. 2023. “Discovering Agents.” Artificial Intelligence.

Kulveit, Douglas, Ammann, et al. 2025. “Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development.”

Legaspi, He, and Toyoizumi. 2019. “Synthetic Agency: Sense of Agency in Artificial Intelligence.” Current Opinion in Behavioral Sciences, Artificial Intelligence,.

Liu, Wang, Li, et al. 2024. “Attaining Human Desirable Outcomes in Human-AI Interaction via Structural Causal Games.”

MacDermott, Fox, Belardinelli, et al. 2024. “Measuring Goal-Directedness.”

Richens, and Everitt. 2024. “Robust Agents Learn Causal World Models.”

van Rijmenam, and Logue. 2021. “Revising the ‘Science of the Organisation’: Theorising AI Agency and Actorhood.” Innovation.

Ward, Francis Rhys, MacDermott, Belardinelli, et al. 2024. “The Reasons That Agents Act: Intention and Instrumental Goals.”

Ward, Francis, Toni, Belardinelli, et al. 2023. “Honesty Is the Best Policy: Defining and Mitigating AI Deception.” In Advances in Neural Information Processing Systems.

Zhuang, and Hadfield-Menell. 2021. “Consequences of Misaligned AI.”