Fractal and self-similar behaviour in neural networks

2025-06-03 — 2025-06-04

AI safety
dynamical systems
machine learning
neural nets
physics
pseudorandomness
sciml
statistics
statmech
stochastic processes
Figure 1

There is lots of fractal-like behaviour in NNs. Not all the senses in which fractal-like-behaviour is used are the same; Figure 2 finds fractals in a transformer residual stream for example, but there are fractal loss landscapes, fractal optimiser paths…

I bet some of these things connect pretty well. Let‘s find out.

1 Fractal loss landscapes

More loss landscape management here [Andreeva et al. (2024); Hennick and Baerdemacker (2025); ].

Estimation theory for fractal qualities from empirical samples is notoriously fiddly. I wonder if the following methods papers help? (Bessis et al. 1987; Bouchaud and Georges 1990; Volkhardt and Grubmüller 2022).

Ly and Gong (2025): Explicit, generative multifractal loss landscape model based on Hölder exponents; dynamics via fractional Langevin.

2 Singular Learning Theory

Relatedly, the LLC of singular learning theory is closely connected to the fractal dimension (!); See Hennick and Baerdemacker (2025), Watanabe (2022) and upcoming work from Zach Furman.

3 Fractal SGD trajectories

Şimşekli et al. (2021): This paper models the trajectories of SGD as being well-approximated by a Feller process. This is a class of continuous-time Markov processes. The Feller process they consider can be driven by more general noise than just Brownian motion. Specifically, it can include α-stable Lévy motion (Eq. 4 in their paper), which produces “heavy-tailed” jumps. The “fractal” aspect refers to the geometric properties (specifically the Hausdorff dimension) of the sample trajectories of these Feller processes. The idea is that the path traced by the optimizer in parameter space is a fractal. The tail-index α of the driving Lévy process (or more generally, the upper Blumenthal–Getoor index $_S$ for the Feller process) becomes a measure of complexity or “capacity”. Lower α (heavier tails) means lower capacity and better generalization. This metric is appealing because it doesn’t necessarily grow with the number of parameters d.

Andreeva et al. (2024): This paper does not directly model the loss landscape as fractal. Instead, it treats the finite sequence of iterates (the training trajectory) \(W_S = {w_k}\) generated by a discrete-time optimizer as a point cloud in parameter space. The “fractal-like” or complex geometric properties of this trajectory are then quantified using tools from topological data analysis (TDA) and metric geometry. See Tolga Birdal’s presentation, Topological Complexity Measures as Proxies for Generalization in Neural Networks.

4 Other weird things

5 References

Andreeva, Dupuis, Sarkar, et al. 2024. Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms.”
Bessis, Fournier, Servizi, et al. 1987. Mellin Transforms of Correlation Integrals and Generalized Dimension of Strange Sets.” Physical Review A.
Bouchaud, and Georges. 1990. Anomalous Diffusion in Disordered Media: Statistical Mechanisms, Models and Physical Applications.” Physics Reports.
Carroll. 2021. “Phase Transitions in Neural Networks.”
Farrugia-Roberts, Murfet, and Geard. 2022. “Structural Degeneracy in Neural Networks.”
Furman. 2025. LLC as Fractal Dimension.”
Hennick, and Baerdemacker. 2025. Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent.”
Lau, Furman, Wang, et al. 2024. The Local Learning Coefficient: A Singularity-Aware Complexity Measure.”
Ly, and Gong. 2025. Optimization on Multifractal Loss Landscapes Explains a Diverse Range of Geometrical and Dynamical Properties of Deep Learning.” Nature Communications.
Shai, Marzen, Teixeira, et al. 2025. Transformers Represent Belief State Geometry in Their Residual Stream.”
Şimşekli, Sener, Deligiannidis, et al. 2021. Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks.” Journal of Statistical Mechanics: Theory and Experiment.
Volkhardt, and Grubmüller. 2022. Estimating Ruggedness of Free-Energy Landscapes of Small Globular Proteins from Principal Component Analysis of Molecular Dynamics Trajectories.” Physical Review E.
Watanabe. 2022. Recent Advances in Algebraic Geometry and Bayesian Statistics.”
Wei, and Lau. 2023. Variational Bayesian Neural Networks via Resolution of Singularities.” Journal of Computational and Graphical Statistics.
Wei, Murfet, Gong, et al. 2023. Deep Learning Is Singular, and That’s Good.” IEEE Transactions on Neural Networks and Learning Systems.