Fractal and self-similar behaviour in neural networks
2025-06-03 — 2025-06-04
Suspiciously similar content
There is lots of fractal-like behaviour in NNs. Not all the senses in which fractal-like-behaviour is used are the same; Figure 2 finds fractals in a transformer residual stream for example, but there are fractal loss landscapes, fractal optimiser paths…
I bet some of these things connect pretty well. Let‘s find out.
1 Fractal loss landscapes
More loss landscape management here [Andreeva et al. (2024); Hennick and Baerdemacker (2025); ].
Estimation theory for fractal qualities from empirical samples is notoriously fiddly. I wonder if the following methods papers help? (Bessis et al. 1987; Bouchaud and Georges 1990; Volkhardt and Grubmüller 2022).
Ly and Gong (2025): Explicit, generative multifractal loss landscape model based on Hölder exponents; dynamics via fractional Langevin.
2 Singular Learning Theory
Relatedly, the LLC of singular learning theory is closely connected to the fractal dimension (!); See Hennick and Baerdemacker (2025), Watanabe (2022) and upcoming work from Zach Furman.
3 Fractal SGD trajectories
Şimşekli et al. (2021): This paper models the trajectories of SGD as being well-approximated by a Feller process. This is a class of continuous-time Markov processes. The Feller process they consider can be driven by more general noise than just Brownian motion. Specifically, it can include α-stable Lévy motion (Eq. 4 in their paper), which produces “heavy-tailed” jumps. The “fractal” aspect refers to the geometric properties (specifically the Hausdorff dimension) of the sample trajectories of these Feller processes. The idea is that the path traced by the optimizer in parameter space is a fractal. The tail-index α of the driving Lévy process (or more generally, the upper Blumenthal–Getoor index $_S$ for the Feller process) becomes a measure of complexity or “capacity”. Lower α (heavier tails) means lower capacity and better generalization. This metric is appealing because it doesn’t necessarily grow with the number of parameters d.
Andreeva et al. (2024): This paper does not directly model the loss landscape as fractal. Instead, it treats the finite sequence of iterates (the training trajectory) \(W_S = {w_k}\) generated by a discrete-time optimizer as a point cloud in parameter space. The “fractal-like” or complex geometric properties of this trajectory are then quantified using tools from topological data analysis (TDA) and metric geometry. See Tolga Birdal’s presentation, Topological Complexity Measures as Proxies for Generalization in Neural Networks.