Placeholder for notes on what kind of world models reside in neural nets.
1 Representational similarity
Are the semantics of embeddings or other internal representations in different models or modalities represented in a common “Platonic” space which is universal in some sense (Huh et al. 2024b)? If so, should we care?
I confess I struggle to see how to make this concrete enough to produce testable hypotheses, but that’s probably because I haven’t read enough of the literature. Here’s something that might have made progress:
- jack morris “excited to finally share on arxiv what we’ve known for a while now: All Embedding Models Learn The Same Thing embeddings from different models are SO similar that we can map between them based on structure alone. without any paired data. feels like magic, but it’s real:🧵” (Jha et al. 2025)
My friend Pascal Hirsch mentions the hypothesis that
This should also apply to the embeddings people have in their brains, referring to this fascinating recent Google paper (Goldstein et al. 2025)
[…] neural activity in the human brain aligns linearly with the internal contextual embeddings of speech and language within LLMs as they process everyday conversations.
2 Causal world models
World models is somehow a different concept than “representation”; I am not precisely sure how, but from skimming it seems like it might be easier to ground in causal abstraction and causal inference.
See causal abstraction for a discussion of the idea that the latent space of a neural net can discover causal representations of the world.
3 Creating worlds to model
Rosas, Boyd, and Baltieri (2025) makes a pleasing connection to the simulation hypothesis:
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. However, accurate world models often have high computational demands that can severely restrict the scope and depth of such assessments. Inspired by the classic `brain in a vat’ thought experiment, here we investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation. By following principles from computational mechanics, our approach reveals a fundamental trade-off in world model construction between efficiency and interpretability, demonstrating that no single world model can optimise all desirable characteristics. Building on this trade-off, we identify procedures to build world models that either minimise memory requirements, delineate the boundaries of what is learnable, or allow tracking causes of undesirable outcomes. In doing so, this work establishes fundamental limits in world modelling, leading to actionable guidelines that inform core design choices related to effective agent evaluation.