Developmental interpretability
2025-03-10 — 2025-05-29
Developmental interpretability is an emerging subfield within AI interpretability that focuses on understanding how neural networks evolve capabilities during training. Rather than analyzing only fully-trained models as static objects, this approach examines the dynamics of learning, capability emergence, and concept formation throughout the training process. It builds on mechanistic interpretability by adding a temporal dimension.
Much of this work explores scaling behaviour in training dynamics, particularly the phase transition when a model suddenly starts to generalise well.
cf the question of when we learn world models.
1 Key Research Directions
Position paper: Lehalleur et al. (2025).
1.1 Mechanistic Phase Transitions
Studies discontinuous capability emergence through training, identifying critical learning thresholds and representation shifts. Key works:
1.2 Training Dynamics Analysis
Examines gradient behaviours, loss landscapes, and parameter space geometry through frameworks like Singular Learning Theory (SLT).
1.2.1 Developmental Circuits Tracing
Maps formation of specific computational patterns from initialization:
See also (Berti, Giorgi, and Kasneci 2025; Teehan et al. 2022; Wei et al. 2022)
1.3 Grokking and Delayed Generalization
Investigates sudden transitions from ‘memorisation’ to ‘understanding’ (Power et al. 2022; Liu et al. 2022; Liu, Michaud, and Tegmark 2023).
TODO: understand how much of the argument leans upon discovering compact circuit representations, and how much upon generalisation, and the relation.
1.4 Component Trajectory Analysis
Tracks evolution of individual neurons/layers through training: - Visualizing Deep Network Training Trajectories with PCA - Mao et al. (2024)
A biased but credible source says of CTA:
One thing I’d say is the Component Trajectory Analysis […] to my eyes, not very interesting, because PCA on timeseries basically just extracts Lissajous curves and therefore always looks like the same thing. [We can] make more sense of this by applying joint PCA to trajectories which vary in their training distribution.
and adds:
joint trajectory PCA is a powerful method when comparing trajectories across any variational parameter, so long as you are tracking model outputs (~losses) on a consistent dataset of examples. In the Linear Regression work that variational parameter controlled the training distribution and the number of “tasks” in that distribution. You can also vary the model scale, or just the random seed / initialisation, or really anything else
See Carroll et al. (2025).
1.5 Curriculum and Data Influences
Studies how training data order/selection impacts capability development. TODO
1.6 Statistical Mechanics of Learning
Statistical mechanics of learning provides a framework for understanding the emergent properties of learning systems, including phase transitions and critical phenomena. A lot of the machinery used in this area seems to be Singular Learning Theory.
1.7 SLT Foundations
There is much work connecting Singular Learning Theory to deep learning dynamics, so much that these terms are often used interchangeably.
- Singular Learning Theory resources
- Filan’s Singular Learning Theory
- Liam Carroll’s Distilling Singular Learning Theory - AI Alignment Forum
- Murfet’s SLT notes
- Singular Learning Theory (SLT) | Liam Carroll (see also, perhaps Carroll (2021) )
- Lehalleur et al. (2025) formally makes an argument that SLT is a theory of developmental interpretability, and that it is the right framework for understanding how neural networks learn.
This body of work argues that SLT explains
- Discontinuous capability emergence through bifurcations in loss landscape geometry
- Bayesian posterior phase transitions in SGD-trained networks
- Fundamental connections between model complexity and generalisation
More at Singular Learning Theory.
2 Incoming
Developmental Interpretability Primer - Community hub for latest research
Sandy Fraser, Selective regularisation for alignment-focused representation engineering
Sandy sets out to do a toy model of representation engineering by curricula, and comes out with something that looks a lot like an empirical learning theory of disentanglement to me.