Figure 1

Developmental interpretability is an emerging subfield within AI interpretability that focuses on understanding how neural networks evolve capabilities during training. Rather than analyzing only fully-trained models as static objects, this approach examines the dynamics of learning, capability emergence, and concept formation throughout the training process. It builds on mechanistic interpretability by adding a temporal dimension.

Much of this work explores scaling behaviour in training dynamics, particularly the phase transition when a model suddenly starts to generalise well.

cf the question of when we learn world models.

1 Key Research Directions

Position paper: Lehalleur et al. ().

1.1 Mechanistic Phase Transitions

Studies discontinuous capability emergence through training, identifying critical learning thresholds and representation shifts. Key works:

1.2 Training Dynamics Analysis

Examines gradient behaviours, loss landscapes, and parameter space geometry through frameworks like Singular Learning Theory (SLT).

1.2.1 Developmental Circuits Tracing

Maps formation of specific computational patterns from initialization:

See also (; ; )

1.3 Grokking and Delayed Generalization

Investigates sudden transitions from ‘memorisation’ to ‘understanding’ (; ; ).

TODO: understand how much of the argument leans upon discovering compact circuit representations, and how much upon generalisation, and the relation.

1.4 Component Trajectory Analysis

Tracks evolution of individual neurons/layers through training: - Visualizing Deep Network Training Trajectories with PCA - Mao et al. ()

A biased but credible source says of CTA:

One thing I’d say is the Component Trajectory Analysis […] to my eyes, not very interesting, because PCA on timeseries basically just extracts Lissajous curves and therefore always looks like the same thing. [We can] make more sense of this by applying joint PCA to trajectories which vary in their training distribution.

and adds:

joint trajectory PCA is a powerful method when comparing trajectories across any variational parameter, so long as you are tracking model outputs (~losses) on a consistent dataset of examples. In the Linear Regression work that variational parameter controlled the training distribution and the number of “tasks” in that distribution. You can also vary the model scale, or just the random seed / initialisation, or really anything else

See Carroll et al. ().

1.5 Curriculum and Data Influences

Studies how training data order/selection impacts capability development. TODO

1.6 Statistical Mechanics of Learning

Statistical mechanics of learning provides a framework for understanding the emergent properties of learning systems, including phase transitions and critical phenomena. A lot of the machinery used in this area seems to be Singular Learning Theory.

1.7 SLT Foundations

There is much work connecting Singular Learning Theory to deep learning dynamics, so much that these terms are often used interchangeably.

This body of work argues that SLT explains

  • Discontinuous capability emergence through bifurcations in loss landscape geometry
  • Bayesian posterior phase transitions in SGD-trained networks
  • Fundamental connections between model complexity and generalisation

More at Singular Learning Theory.

2 Incoming

3 References

Berti, Giorgi, and Kasneci. 2025. Emergent Abilities in Large Language Models: A Survey.”
Carroll. 2021. “Phase Transitions in Neural Networks.”
Carroll, Hoogland, Farrugia-Roberts, et al. 2025. Dynamics of Transient Structure in In-Context Linear Regression Transformers.”
Chen, Xin, As, and Krause. 2025. Learning Safety Constraints for Large Language Models.”
Chen, Zhongtian, Lau, Mendel, et al. 2023. Dynamical Versus Bayesian Phase Transitions in a Toy Model of Superposition.”
Elhage, Hume, Olsson, et al. 2022. Toy Models of Superposition.”
Lehalleur, Hoogland, Farrugia-Roberts, et al. 2025. You Are What You Eat — AI Alignment Requires Understanding How Data Shapes Structure and Generalisation.”
Liu, Kitouni, Nolte, et al. 2022. Towards Understanding Grokking: An Effective Theory of Representation Learning.” Advances in Neural Information Processing Systems.
Liu, Michaud, and Tegmark. 2023. Omnigrok: Grokking Beyond Algorithmic Data.”
Lorch. 2016. “Visualizing Deep Network Training Trajectories with PCA.”
Mao, Griniasty, Teoh, et al. 2024. The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold.” Proceedings of the National Academy of Sciences.
Nanda, Chan, Lieberum, et al. 2023. Progress Measures for Grokking via Mechanistic Interpretability.”
Olah, Cammarata, Schubert, et al. 2020. Zoom In: An Introduction to Circuits.” Distill.
Plum, and Serra. 2025. Dynamical Systems of Fate and Form in Development.” Seminars in Cell & Developmental Biology.
Power, Burda, Edwards, et al. 2022. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.”
Teehan, Clinciu, Serikov, et al. 2022. Emergent Structures and Training Dynamics in Large Language Models.” In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models.
Wang, Farrugia-Roberts, Hoogland, et al. 2024. Loss Landscape Geometry Reveals Stagewise Development of Transformers.” In.
Watanabe. 2022. Recent Advances in Algebraic Geometry and Bayesian Statistics.”
Wei, Tay, Bommasani, et al. 2022. Emergent Abilities of Large Language Models.”