I just ran into this area while trying to invent something similar myself, only to find I’m years too late. It’s an interesting analysis suited to relaxed or approximated causal modelling of causal interventions. It seems to formalize coarse-graining for causal models.
We suspect that the notorious causal inference in LLMs might be built out of such things or understood in terms of them.
1 Causality in hierarchical systems
In the hierarchical setting, we want to consider a system made of micro- and macro-states; we wonder when many microstates can be abstracted into a few macrostates in a causal sense.
A. Geiger, Ibeling, et al. (2024) summarises:
In some ways, studying modern deep learning models is like studying the weather or an economy: they involve large numbers of densely connected ‘microvariables’ with complex, non-linear dynamics. One way of reining in this complexity is to find ways of understanding these systems in terms of higher-level, more abstract variables (‘macrovariables’). For instance, the many microvariables might be clustered together into more abstract macrovariables. A number of researchers have been exploring theories of causal abstraction, providing a mathematical framework for causally analysing a system at multiple levels of detail (Chalupka, Eberhardt, and Perona 2017; Rubenstein et al. 2017; Beckers and Halpern 2019, 2019; Rischel and Weichwald 2021; Massidda et al. 2023). These methods tell us when a high-level causal model is a simplification of a (typically more fine-grained) low-level model. To date, causal abstraction has been used to analyse weather patterns (Chalupka et al. 2016), human brains (J. Dubois, Oya, et al. 2020; J. Dubois, Eberhardt, et al. 2020), and deep learning models (Chalupka, Perona, and Eberhardt 2015; A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2021; Hu and Tian 2022; A. Geiger, Wu, et al. 2024; Z. Wu et al. 2023).
Imagine trying to understand a bustling city by tracking everyone’s movement. This “micro-level” perspective is overwhelming. Instead, we might analyse neighbourhoods (macro-level) to identify traffic patterns or economic activity. In physics, we call this coarse-graining. Causal abstraction tries to ask a more statistical question: When does a simplified high-level model (macrovariables) accurately represent a detailed low-level system (microvariables)?
For example, a neural network classifies images using millions of neurons (microvariables). A causal abstraction might represent this as a high-level flowchart: Input Image → Detect Edges → Identify Shapes → Classify Object
This flowchart is a macrovariable model that abstracts away neuronal details while preserving the “causal story” of how the network works.
Easy to say, harder to formalize.
Chalupka, Eberhardt, and Perona (2016) explains this idea with equivalence classes of variable states that induce a partition on the space of possible causal models. The fundamental object from this hierarchical perspective is a causal partition. Chalupka, Eberhardt, and Perona (2017) constructs some contrived worked examples. They work in terms of discrete variables (or discretisations of continuous ones) to make it easy to discuss the measure of the sets implicated in the partition. I am going to leave this work aside for now, because it is a nice intuition pump but way too clunky for what I need.
2 Non-hierarchical models
A. Geiger, Ibeling, et al. (2024) generalises this further; we are happy to consider equivalence classes over “messy” structures where, for example, microvariables are not neatly partitioned into macrovariables but may be involved in many macrovariables. They would further like to handle systems with loops. At the end, they would like to argue that it is a unifying language for causality in machine learning, and in particular for mechanistic interpretability, and ablation studies.
A shortcoming of existing theory is that macrovariables cannot be represented by quantities formed from overlapping sets of microvariables. Just as with neural network models of human cognition (Smolensky, 1986), this is the typical situation in mechanistic interpretability, where high-level concepts are thought to be represented by modular ‘features’ distributed across individual neural activations […].
Our first contribution is to extend the theory of causal abstraction to remove this limitation, building heavily on previous work. The core issue is that typical hard and soft interventions replace variable mechanisms entirely, so they are unable to isolate quantities distributed across overlapping sets of microvariables. To address this, we consider a very general type of intervention—what we call interventionals—that maps from old mechanisms to new mechanisms. While this space of operations is generally unconstrained, we isolate special classes of interventionals that form intervention algebras, satisfying two key modularity properties. Such classes can essentially be treated as hard interventions with respect to a new (‘translated’) variable space. We elucidate this situation, generalising earlier work by Rubenstein et al. (2017) and Beckers and Halpern (2019).
2.1 Distributed alignment search
e.g. (Abraham et al. 2022; Arora, Jurafsky, and Potts 2024; A. Geiger, Wu, et al. 2024; Tigges et al. 2023)
3 Interventions
To validate abstractions, we use interventions — controlled changes to a system. There seem to be levels of abstraction.
- Hard interventions: Force variables to specific values (e.g., clamping a neuron’s activation), which are the classic Judea-Pearl-style interventions made famous by the do-calculus
- Soft interventions: These look like “distributional” assignments or something like that. Rather than “setting” a variable to a value as in hard interventions, we assign it a distribution. This idea seems simple and intuitive to me in Correa and Bareinboim (2020); the presentation in A. Geiger, Ibeling, et al. (2024) a little more opaque to me.
- In the next section we generalise these to Interventionals: Generalised transformations of mechanisms (e.g., redistributing a concept across multiple neurons). This is the new thing in A. Geiger, Ibeling, et al. (2024) and I have no intuition about it yet at all.
4 LLM summary
Here be dragons! I got perplexity to summarise all the strands. I guarantee even less than usual about the correctness of this summary.
Recent advances in causal abstraction theory have provided rigorous mathematical frameworks for analysing systems at multiple levels of granularity while preserving causal structure. This report synthesises the core contributions across key papers in this domain, examining both theoretical foundations and practical applications.
4.1 Formal Foundations of Causal Abstraction
The foundational work of Beckers & Halpern (2019) established τ-abstractions as a precise mechanism for mapping between causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their framework introduced: - A three-component abstraction tuple (τ, ω, σ) mapping variables, interventions, and outcomes between models - Compositionality guarantees ensuring abstraction hierarchies maintain causal consistency - Distinction between exact vs approximate abstractions through error bounds (Beckers and Halpern 2020; Shin and Gerstenberg 2023)
Building on this, Rubenstein et al. (2017) first formalised the notion of exact transformations between structural causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their key insight was establishing intervention preservation requirements through commutative diagrams:
where ω maps low-level interventions
Rischel and Weichwald (2021) advanced compositionality through category theory, proving error bounds satisfy:
using enriched category structures (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020). Their framework introduced KL-divergence based error metrics while maintaining causal semantics across transformations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023).
4.2 Approximation and Error Quantification
Beckers and Halpern (2019) introduced formal error metrics for approximate abstractions through:
- Intervention-specific divergence measures
- Worst-case error bounds across allowed interventions
- Probabilistic extensions handling observational uncertainty (Beckers and Halpern 2020; Shin and Gerstenberg 2023).
This was operationalized through error lattices where approximation quality could be analysed at different granularities Beckers and Halpern (2020) . Massidda et al. (2023) extended this to soft interventions, proving uniqueness conditions for intervention maps ω under mechanism preservation constraints (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017).
Key theoretical results include:
- Compositionality of abstraction errors (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023)
- Explicit construction of ω maps via quotient spaces (Massidda et al. 2023)
- Duality between variable clustering and intervention preservation (Beckers and Halpern 2020; Beckers and Halpern 2019)
4.3 Applications Across Domains
4.3.1 Neuroscience
(J. Dubois, Oya, et al. 2020; D. Dubois and Prade 2020; J. Dubois, Eberhardt, et al. 2020) applt causal abstraction to neural population dynamics, demonstrating
- Valid abstractions from spiking models to mean-field approximations
- Emergent causal patterns in coarse-grained neural representations
- Intervention preservation across biological scales (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017)
4.3.2 Climate Science
Chalupka et al. (2016) showed how El Niño models could be abstracted from high-dimensional wind/temperature data through
- Variable clustering preserving causal connectivity
- Intervention consistency for climate predictions
- Validation through hurricane trajectory simulations (Beckers and Halpern 2020, 2020)
4.3.3 Deep Learning
(A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2022; A. R. Geiger 2023; A. Geiger, Wu, et al. 2024) developed interchange intervention techniques for analysing neural networks
- Alignment between model layers and symbolic reasoning steps
- Causal faithfulness metrics for transformer architectures
- Applications to NLP and computer vision models (A. Geiger et al. 2021, 2021; Chalupka, Eberhardt, and Perona 2017)
Their ANTRA framework enabled testing whether neural networks implement known algorithmic structures through intervention graphs (A. Geiger et al. 2021, 2021).
5 Methodological Themes
- Intervention-Centric Formalisation: All approaches centre intervention preservation as the core abstraction criterion (Beckers and Halpern 2020; Beckers and Halpern 2019; Massidda et al. 2023)
- Compositionality: Hierarchical error propagation and transform composition are fundamental requirements (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
- Approximation Metrics: KL-divergence, Wasserstein distance, and intervention-specific losses dominate (Beckers and Halpern 2020; Shin and Gerstenberg 2023)
- Algebraic Structures: Category theory and lattice frameworks provide mathematical foundations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
- Empirical Validation: Applications demonstrate abstraction viability through simulation and model testing (A. Geiger et al. 2021, 2021; Massidda et al. 2023)