World models arising in foundation models.
2024-12-20 — 2026-04-09
Wherein the Convergence of Internal Representations Across Differing Model Architectures Is Examined, and a Fundamental Trade-Off Between Efficiency and Interpretability in World Model Construction Is Revealed.
Placeholder notes on what kinds of world models sit inside large neural nets. It seems they do have some kind of internal model of the outside world; in practice, what kind of thing is it?
1 What does it even mean to have a world model?
Surprisingly hard to pin down. 🚧TODO🚧
2 Representational similarity
3 Causal world models
World models are somehow a somewhat different concept than “representation”; I’m not precisely sure how, but from skimming, it seems like world models might be easier to ground in causal abstraction and causal inference.
Interesting models of how models learn world models include (Halpern and Piermont 2024; Hu and Shu 2023; Richens and Everitt 2024; Richens et al. 2025; Rosas, Boyd, and Baltieri 2025)
See causal abstraction for a discussion of the idea that a neural net’s latent space can end up discovering causal representations of the world in some specific approximate sense.
4 Creating worlds to model
Rosas, Boyd, and Baltieri (2025) makes a pleasing connection to the simulation hypothesis:
Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. However, accurate world models often have high computational demands that can severely restrict the scope and depth of such assessments. Inspired by the classic `brain in a vat’ thought experiment, here we investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation. By following principles from computational mechanics, our approach reveals a fundamental trade-off in world model construction between efficiency and interpretability, demonstrating that no single world model can optimise all desirable characteristics. Building on this trade-off, we identify procedures to build world models that either minimise memory requirements, delineate the boundaries of what is learnable, or allow tracking causes of undesirable outcomes. In doing so, this work establishes fundamental limits in world modelling, leading to actionable guidelines that inform core design choices related to effective agent evaluation.
