Fractal and self-similar behaviour in neural networks

2025-06-03 — 2025-06-04

Wherein fractal behaviour in neural networks is surveyed, and SGD trajectories are described via Hausdorff dimensions of optimizer paths, while multifractal loss landscapes are modelled using Hölder exponents and fractional Langevin dynamics.

AI safety
dynamical systems
machine learning
neural nets
physics
pseudorandomness
sciml
statistics
statmech
stochastic processes
Figure 1

There’s a lot of fractal-like behaviour in NNs. Not all the senses in which fractal-like-behaviour is used are the same; for example, Figure 2 finds fractals in a transformer residual stream, but there are fractal loss landscapes and fractal optimiser paths…

I bet some of these things connect pretty well. Let’s find out.

1 Fractal loss landscapes

More on loss landscape management here [Andreeva et al. (2024); Hennick and Baerdemacker (2025); ].

Estimating fractal qualities from empirical samples is notoriously fiddly. I wonder if the following methods papers help? (Bessis et al. 1987; Bouchaud and Georges 1990; Volkhardt and Grubmüller 2022).

Ly and Gong (2025): Explicit, generative multifractal loss landscape model based on Hölder exponents; dynamics via fractional Langevin.

2 Singular Learning Theory

Relatedly, the LLC of singular learning theory is closely connected to the fractal dimension (!). [TODO clarify] See Hennick and Baerdemacker (2025), Watanabe (2022) and upcoming work from Zach Furman.

3 Fractal SGD trajectories

Şimşekli et al. (2021): This paper models SGD trajectories as well approximated by a Feller process, a class of continuous-time Markov processes. The Feller process they consider can be driven by more general noise than Brownian motion — specifically, α-stable Lévy motion (Eq. 4 in their paper), which produces “heavy-tailed” jumps. The “fractal” aspect refers to the geometric properties (specifically the Hausdorff dimension) of the sample trajectories of these Feller processes. The idea is that the path traced by the optimizer in parameter space is a fractal. The tail-index α of the driving Lévy process (or more generally, the upper Blumenthal–Getoor index $_S$ for the Feller process) becomes a measure of complexity or “capacity”. Lower α (heavier tails) means lower capacity and better generalization. This metric is appealing because it doesn’t necessarily grow with the number of parameters d.

Andreeva et al. (2024): This paper does not directly model the loss landscape as fractal. Instead, it treats the finite sequence of iterates (the training trajectory) \(W_S = {w_k}\) generated by a discrete-time optimizer as a point cloud in parameter space. The “fractal-like” or complex geometric properties of this trajectory are then quantified using tools from topological data analysis (TDA) and metric geometry. See Tolga Birdal’s presentation, Topological Complexity Measures as Proxies for Generalization in Neural Networks.

4 Other weird things

5 References

Andreeva, Dupuis, Sarkar, et al. 2024. Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms.”
Bessis, Fournier, Servizi, et al. 1987. Mellin Transforms of Correlation Integrals and Generalized Dimension of Strange Sets.” Physical Review A.
Bouchaud, and Georges. 1990. Anomalous Diffusion in Disordered Media: Statistical Mechanisms, Models and Physical Applications.” Physics Reports.
Carroll. 2021. “Phase Transitions in Neural Networks.”
Farrugia-Roberts, Murfet, and Geard. 2022. “Structural Degeneracy in Neural Networks.”
Furman. 2025. LLC as Fractal Dimension.”
Hennick, and Baerdemacker. 2025. Almost Bayesian: The Fractal Dynamics of Stochastic Gradient Descent.”
Lau, Furman, Wang, et al. 2024. The Local Learning Coefficient: A Singularity-Aware Complexity Measure.”
Ly, and Gong. 2025. Optimization on Multifractal Loss Landscapes Explains a Diverse Range of Geometrical and Dynamical Properties of Deep Learning.” Nature Communications.
Shai, Marzen, Teixeira, et al. 2025. Transformers Represent Belief State Geometry in Their Residual Stream.”
Şimşekli, Sener, Deligiannidis, et al. 2021. Hausdorff Dimension, Heavy Tails, and Generalization in Neural Networks.” Journal of Statistical Mechanics: Theory and Experiment.
Volkhardt, and Grubmüller. 2022. Estimating Ruggedness of Free-Energy Landscapes of Small Globular Proteins from Principal Component Analysis of Molecular Dynamics Trajectories.” Physical Review E.
Watanabe. 2022. Recent Advances in Algebraic Geometry and Bayesian Statistics.”
Wei, and Lau. 2023. Variational Bayesian Neural Networks via Resolution of Singularities.” Journal of Computational and Graphical Statistics.
Wei, Murfet, Gong, et al. 2023. Deep Learning Is Singular, and That’s Good.” IEEE Transactions on Neural Networks and Learning Systems.