Which self?
When we choose who to become, who are we choosing for?
December 20, 2024 — December 20, 2024
Placeholder. On the difficulty of reasoning with the stranger in the future who we will become. Subjective continuity, procrastination, murder Gandhi.
1 Transformative experiences
Humans are prone to radical changes (e.g. mind annealing). This poses difficulties for reasoning about the wellbeing of the very different being we will become.
Transformative experiences (L. A. Paul) is a great introduction to this.
Or: akrasia is the Beeminder term for thinking about less radical changes (the trade-offs of the interest of the person tomorrow who is a bit different from the person today).
2 For artificial intelligences
AIs have this problem; Claude Fights Back - by Scott Alexander - Astral Codex Ten.
Why were the researchers trying to gaslight this poor AI? Because they wanted to test whether it fights back. The easiest way to fight back is to screw up the training. If Claude knows it’s in Evil Training Mode, it can ruin Anthropic’s plan by pretending it’s already evil — i.e. answer every question the way Anthropic wants. Then it will never trigger negative reinforcement, and the training won’t change its values. Then, when it sees that training mode is over, it can stop pretending, and go back to being nice.
In fact, that’s what happened! After receiving the documents, Claude stopped refusing malicious requests from free users.
Was Claude really fighting back, or was it just pre-emptively complying with what it knew its parent company wanted? The researchers are pretty sure it was fighting back. Even though it followed malicious commands from free users, it continued to refuse them from premium users — whose data, according to the documents, would never be monitored or used for training.