Developmental interpretability
2025-03-10 — 2025-11-15
Wherein the training evolution of neural networks is traced, abrupt phase transitions in loss geometry are examined, critical learning periods are identified, joint trajectory PCA across seeds is applied, and singular learning theory is invoked.
Developmental interpretability is an emerging subfield of AI interpretability that focuses on how neural networks develop capabilities during training. Rather than analyzing only fully trained models as static objects, we examine the dynamics of learning, capability emergence, and concept formation throughout training. It builds on mechanistic interpretability by adding a temporal dimension.
Much of this work explores scaling behaviour in training dynamics, particularly the phase transition when a model suddenly begins to generalize well.
See also the question of when we learn world models.
1 Key Research Directions
Position paper: Lehalleur et al. (2025).
2 Mechanistic Phase Transitions
This area studies discontinuous capability emergence during training, identifying critical learning thresholds and representational shifts. Key works:
3 Training Dynamics Analysis
We examine gradient behaviours, loss landscapes, and parameter-space geometry using frameworks like Singular Learning Theory (SLT).
3.1 Developmental Circuits Tracing
This traces how specific computational patterns form from initialization:
See also (Berti, Giorgi, and Kasneci 2025; Teehan et al. 2022; Wei et al. 2022)
4 Grokking and Delayed Generalization
This line investigates sudden transitions from ‘memorization’ to ‘understanding’ (Power et al. 2022; Liu et al. 2022; Liu, Michaud, and Tegmark 2023).
We need to understand how much of the argument depends on discovering compact circuit representations, how much depends on generalization, and how the two relate.
5 Component Trajectory Analysis
This tracks the evolution of individual neurons and layers during training: - Visualizing Deep Network Training Trajectories with PCA - Mao et al. (2024)
A biased but credible source says the following about Component Trajectory Analysis (CTA):
One thing I’d say is the Component Trajectory Analysis […] to my eyes, not very interesting, because PCA on timeseries basically just extracts Lissajous curves and therefore always looks like the same thing. [We can] make more sense of this by applying joint PCA to trajectories which vary in their training distribution.
and adds:
joint trajectory PCA is a powerful method when comparing trajectories across any variational parameter, so long as you are tracking model outputs (~losses) on a consistent dataset of examples. In the Linear Regression work that variational parameter controlled the training distribution and the number of “tasks” in that distribution. You can also vary the model scale, or just the random seed / initialisation, or really anything else
See Carroll et al. (2025).
6 Curriculum and Data Influences
We study how the order and selection of training data impact capability development. [TODO clarify]
7 Statistical Mechanics of Learning
Statistical mechanics of learning provides a framework for understanding the emergent properties of learning systems, including phase transitions and critical phenomena. Much of the machinery comes from Singular Learning Theory.
8 SLT Foundations
There’s a lot of work connecting Singular Learning Theory to deep learning dynamics—so much that people often use the terms interchangeably.
- Singular Learning Theory resources
- Filan’s Singular Learning Theory
- Liam Carroll’s Distilling Singular Learning Theory - AI Alignment Forum
- Murfet’s SLT notes
- Singular Learning Theory (SLT) | Liam Carroll (see also, perhaps Carroll (2021) )
- Lehalleur et al. (2025) formally argues that SLT is a theory of developmental interpretability and the right framework for understanding how neural networks learn.
This body of work argues that SLT explains:
- Discontinuous capability emergence through bifurcations in loss landscape geometry
- Bayesian posterior phase transitions in SGD-trained networks
- Fundamental connections between model complexity and generalization
More at Singular Learning Theory.
9 Vis estimates of model dimension
See DOF estimates in NNs.
10 Critical periods
To understand critical learning periods within deep learning, it is helpful to first look at a related analogy to biological systems. Within humans and animals, critical periods are defined as times of early post-natal (i.e., after birth) development, during which impairments to learning (e.g., sensory deficits) can lead to permanent impairment of one’s skills [5]. For example, vision impairments at a young age—a critical period for the development of one’s eyesight—often lead to problems like amblyopia in adult humans.
[…]this concept of critical learning periods is still curiously relevant to deep learning, as the same behavior is exhibited within the learning process for neural networks. If a neural network is subjected to some impairment (e.g., only shown blurry images or not regularized properly) during the early phase of learning, the resulting network (after training is fully complete) will generalize more poorly relative to a network that never received such an early learning impairment, even given an unlimited training budget. Recovering from this early learning impairment is not possible.
(Achille, Rovere, and Soatto 2018; Elman1993Learninga?; Fukase2025Onea?; Kleinman, Achille, and Soatto 2024)
11 Incoming
Developmental Interpretability Primer — Community hub for the latest research
Sandy Fraser, Selective regularization for alignment-focused representation engineering
Sandy builds a toy model of representation engineering through curricula and ends up with something that looks a lot like an empirical learning theory of disentanglement.
