Platonic and convergent representations in neural nets
2024-12-20 — 2026-06-29
Wherein the Convergence of Internal Representations Across Neural Networks and Human Brains Is Examined, With Attention to Findings That Embedding Models of Differing Architectures May Be Mapped Between One Another Without Paired Data.
Placeholder notes on when representations of the world in learning systems converge, in some sense, to “universal” or “Platonic” representations. It seems that such systems, including LLMs and most neural networks, do have some kind of internal model of the outside world; what needs do such models share?
Are the semantics of embeddings or other internal representations in different models or modalities represented in a common “Platonic” space that’s universal in some sense (Huh et al. 2024b)? If so, should we care?
I confess I struggle to make this concrete enough to produce testable hypotheses; that’s probably because I haven’t read enough of the literature. Here’s something that might be progress:
- Jack Morris “Excited to finally share on arXiv what we’ve known for a while now: All Embedding Models Learn The Same Thing. Embeddings from different models are so similar that we can map between them based on structure alone — without any paired data. Feels like magic, but it’s real:🧵” (Jha et al. 2025)
My friend Pascal Hirsch mentions the hypothesis that
This should also apply to the embeddings people have in their brains, referring to this fascinating recent Google paper (Goldstein et al. 2025)
[…] neural activity in the human brain aligns linearly with the internal contextual embeddings of speech and language within LLMs as they process everyday conversations.
