Developmental interpretability

2025-03-10 — 2025-11-15

Wherein the training evolution of neural networks is traced, abrupt phase transitions in loss geometry are examined, critical learning periods are identified, joint trajectory PCA across seeds is applied, and singular learning theory is invoked.

AI safety

Bayes

bounded compute

dynamical systems

feature construction

high d

language

machine learning

metrics

mind

NLP

sparser than thou

statmech

stochastic processes

Developmental interpretability is an emerging subfield of AI interpretability that focuses on how neural networks develop capabilities during training. Rather than analyzing only fully trained models as static objects, we examine the dynamics of learning, capability emergence, and concept formation throughout training. It builds on mechanistic interpretability by adding a temporal dimension.

Much of this work explores scaling behaviour in training dynamics, particularly the phase transition when a model suddenly begins to generalize well.

See also the question of when we learn world models.

1 Key Research Directions

Position paper: Lehalleur et al. (2025).

2 Mechanistic Phase Transitions

This area studies discontinuous capability emergence during training, identifying critical learning thresholds and representational shifts. Key works:

Nanda et al. (2023) on Grokking

3 Training Dynamics Analysis

We examine gradient behaviours, loss landscapes, and parameter-space geometry using frameworks like Singular Learning Theory (SLT).

3.1 Developmental Circuits Tracing

This traces how specific computational patterns form from initialization:

Toy Models of Superposition (Elhage et al. 2022)

4 Grokking and Delayed Generalization

This line investigates sudden transitions from ‘memorization’ to ‘understanding’ (Power et al. 2022; Liu et al. 2022; Liu, Michaud, and Tegmark 2023).

We need to understand how much of the argument depends on discovering compact circuit representations, how much depends on generalization, and how the two relate.

5 Component Trajectory Analysis

This tracks the evolution of individual neurons and layers during training: - Visualizing Deep Network Training Trajectories with PCA - Mao et al. (2024)

A biased but credible source says the following about Component Trajectory Analysis (CTA):

One thing I’d say is the Component Trajectory Analysis […] to my eyes, not very interesting, because PCA on timeseries basically just extracts Lissajous curves and therefore always looks like the same thing. [We can] make more sense of this by applying joint PCA to trajectories which vary in their training distribution.

and adds:

joint trajectory PCA is a powerful method when comparing trajectories across any variational parameter, so long as you are tracking model outputs (~losses) on a consistent dataset of examples. In the Linear Regression work that variational parameter controlled the training distribution and the number of “tasks” in that distribution. You can also vary the model scale, or just the random seed / initialisation, or really anything else

See Carroll et al. (2025).

6 Curriculum and Data Influences

We study how the order and selection of training data impact capability development. [TODO clarify]

7 Statistical Mechanics of Learning

Statistical mechanics of learning provides a framework for understanding the emergent properties of learning systems, including phase transitions and critical phenomena. Much of the machinery comes from Singular Learning Theory.

8 SLT Foundations

There’s a lot of work connecting Singular Learning Theory to deep learning dynamics—so much that people often use the terms interchangeably.

Singular Learning Theory resources
Filan’s Singular Learning Theory
Liam Carroll’s Distilling Singular Learning Theory - AI Alignment Forum
Murfet’s SLT notes
Singular Learning Theory (SLT) | Liam Carroll (see also, perhaps Carroll (2021) )
Lehalleur et al. (2025) formally argues that SLT is a theory of developmental interpretability and the right framework for understanding how neural networks learn.

This body of work argues that SLT explains:

Discontinuous capability emergence through bifurcations in loss landscape geometry
Bayesian posterior phase transitions in SGD-trained networks
Fundamental connections between model complexity and generalization

More at Singular Learning Theory.

9 Vis estimates of model dimension

See DOF estimates in NNs.

10 Critical periods

Critical Learning Periods in Deep Networks — Cameron R. Wolfe, Ph.D.

To understand critical learning periods within deep learning, it is helpful to first look at a related analogy to biological systems. Within humans and animals, critical periods are defined as times of early post-natal (i.e., after birth) development, during which impairments to learning (e.g., sensory deficits) can lead to permanent impairment of one’s skills [5]. For example, vision impairments at a young age—a critical period for the development of one’s eyesight—often lead to problems like amblyopia in adult humans.

[…]this concept of critical learning periods is still curiously relevant to deep learning, as the same behavior is exhibited within the learning process for neural networks. If a neural network is subjected to some impairment (e.g., only shown blurry images or not regularized properly) during the early phase of learning, the resulting network (after training is fully complete) will generalize more poorly relative to a network that never received such an early learning impairment, even given an unlimited training budget. Recovering from this early learning impairment is not possible.

(Achille, Rovere, and Soatto 2018; Elman1993Learninga?; Fukase2025Onea?; Kleinman, Achille, and Soatto 2024)

11 Incoming

rainbowserpend.dev (Wang et al. 2025)
Developmental Interpretability Primer — Community hub for the latest research
Sandy Fraser, Selective regularization for alignment-focused representation engineering

Sandy builds a toy model of representation engineering through curricula and ends up with something that looks a lot like an empirical learning theory of disentanglement.

12 References

Achille, Rovere, and Soatto. 2018. “Critical Learning Periods in Deep Networks.” In.

Berti, Giorgi, and Kasneci. 2025. “Emergent Abilities in Large Language Models: A Survey.”

Carroll. 2021. “Phase Transitions in Neural Networks.”

Carroll, Hoogland, Farrugia-Roberts, et al. 2025. “Dynamics of Transient Structure in In-Context Linear Regression Transformers.”

Chen, Xin, As, and Krause. 2025. “Learning Safety Constraints for Large Language Models.”

Chen, Zhongtian, Lau, Mendel, et al. 2023. “Dynamical Versus Bayesian Phase Transitions in a Toy Model of Superposition.”

Elhage, Hume, Olsson, et al. 2022. “Toy Models of Superposition.”

Elman. 1993. “Learning and Development in Neural Networks: The Importance of Starting Small.” Cognition.

Fukase, Gama, Bueno, et al. 2025. “One Period to Rule Them All: Identifying Critical Learning Periods in Deep Networks.” Procedia Computer Science.

Kleinman, Achille, and Soatto. 2024. “Critical Learning Periods Emerge Even in Deep Linear Networks.”

Lehalleur, Hoogland, Farrugia-Roberts, et al. 2025. “You Are What You Eat — AI Alignment Requires Understanding How Data Shapes Structure and Generalisation.”

Liu, Kitouni, Nolte, et al. 2022. “Towards Understanding Grokking: An Effective Theory of Representation Learning.” Advances in Neural Information Processing Systems.

Liu, Michaud, and Tegmark. 2023. “Omnigrok: Grokking Beyond Algorithmic Data.”

Lorch. 2016. “Visualizing Deep Network Training Trajectories with PCA.”

Mao, Griniasty, Teoh, et al. 2024. “The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold.” Proceedings of the National Academy of Sciences.

Mlodozeniec, Reid, Power, et al. 2025. “Distributional Training Data Attribution.”

Nanda, Chan, Lieberum, et al. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.”

Olah, Cammarata, Schubert, et al. 2020. “Zoom In: An Introduction to Circuits.” Distill.

Pascanu, Lyle, Modoranu, et al. 2025. “Optimizers Qualitatively Alter Solutions And We Should Leverage This.”

Plum, and Serra. 2025. “Dynamical Systems of Fate and Form in Development.” Seminars in Cell & Developmental Biology.

Power, Burda, Edwards, et al. 2022. “Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets.”

Refinetti, D’Ascoli, Ohana, et al. 2021. “Align, Then Memorise: The Dynamics of Learning with Feedback Alignment.” In Proceedings of the 38th International Conference on Machine Learning.

Shoham, Mor-Yosef, and Avron. 2025. “Flatness After All?”

Teehan, Clinciu, Serikov, et al. 2022. “Emergent Structures and Training Dynamics in Large Language Models.” In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models.

Wang, Baker, Gordon, et al. 2025. “Embryology of a Language Model.”

Wang, Farrugia-Roberts, Hoogland, et al. 2024. “Loss Landscape Geometry Reveals Stagewise Development of Transformers.” In.

Watanabe. 2022. “Recent Advances in Algebraic Geometry and Bayesian Statistics.”

Wei, Tay, Bommasani, et al. 2022. “Emergent Abilities of Large Language Models.”