Fractal and self-similar behaviour in neural networks
2025-06-03 — 2025-06-04
Wherein fractal behaviour in neural networks is surveyed, and SGD trajectories are described via Hausdorff dimensions of optimizer paths, while multifractal loss landscapes are modelled using Hölder exponents and fractional Langevin dynamics.
There’s a lot of fractal-like behaviour in NNs. Not all the senses in which fractal-like-behaviour is used are the same; for example, Figure 2 finds fractals in a transformer residual stream, but there are fractal loss landscapes and fractal optimiser paths…
I bet some of these things connect pretty well. Let’s find out.
1 Fractal loss landscapes
More on loss landscape management here [Andreeva et al. (2024); Hennick and Baerdemacker (2025); ].
Estimating fractal qualities from empirical samples is notoriously fiddly. I wonder if the following methods papers help? (Bessis et al. 1987; Bouchaud and Georges 1990; Volkhardt and Grubmüller 2022).
Ly and Gong (2025): Explicit, generative multifractal loss landscape model based on Hölder exponents; dynamics via fractional Langevin.
2 Singular Learning Theory
Relatedly, the LLC of singular learning theory is closely connected to the fractal dimension (!). [TODO clarify] See Hennick and Baerdemacker (2025), Watanabe (2022) and upcoming work from Zach Furman.
3 Fractal SGD trajectories
Şimşekli et al. (2021): This paper models SGD trajectories as well approximated by a Feller process, a class of continuous-time Markov processes. The Feller process they consider can be driven by more general noise than Brownian motion — specifically, α-stable Lévy motion (Eq. 4 in their paper), which produces “heavy-tailed” jumps. The “fractal” aspect refers to the geometric properties (specifically the Hausdorff dimension) of the sample trajectories of these Feller processes. The idea is that the path traced by the optimizer in parameter space is a fractal. The tail-index α of the driving Lévy process (or more generally, the upper Blumenthal–Getoor index $_S$ for the Feller process) becomes a measure of complexity or “capacity”. Lower α (heavier tails) means lower capacity and better generalization. This metric is appealing because it doesn’t necessarily grow with the number of parameters d.
Andreeva et al. (2024): This paper does not directly model the loss landscape as fractal. Instead, it treats the finite sequence of iterates (the training trajectory) \(W_S = {w_k}\) generated by a discrete-time optimizer as a point cloud in parameter space. The “fractal-like” or complex geometric properties of this trajectory are then quantified using tools from topological data analysis (TDA) and metric geometry. See Tolga Birdal’s presentation, Topological Complexity Measures as Proxies for Generalization in Neural Networks.