Causal abstraction
Coarse-graining for causal models
2025-02-24 — 2025-12-20
Wherein causal abstraction is presented as a formalism for mapping macro interventions to micro mechanisms via a learnable translator called interventionals, and intervention equivalence is treated as a partition over perturbations.
I ran into this field of research while trying to invent it, then realized I was years too late.
Causal “abstraction”-type approaches extend or replace traditional causal inference with relaxed or approximate causal modelling of interventions.
As such, we can probably think of them as formalizing coarse-graining for causal models.
We suspect that the notorious capacity for causal inference in LLMs might be built on these ideas, or at least understood in those terms.
Q: How do we bridge this to disentangled representation learning?
1 Causality in hierarchical systems
In the hierarchical setting, we consider a system made up of micro- and macro-states; we ask when many micro-states can be abstracted into a few macro-states in a causal sense.
A. Geiger, Ibeling, et al. (2024) summarizes:
In some ways, studying modern deep learning models is like studying the weather or an economy: they involve large numbers of densely connected ‘microvariables’ with complex, non-linear dynamics. One way of reining in this complexity is to find ways of understanding these systems in terms of higher-level, more abstract variables (‘macrovariables’). For instance, the many microvariables might be clustered together into more abstract macrovariables. A number of researchers have been exploring theories of causal abstraction, providing a mathematical framework for causally analysing a system at multiple levels of detail (Chalupka, Eberhardt, and Perona 2017; Rubenstein et al. 2017; Beckers and Halpern 2019, 2019; Rischel and Weichwald 2021; Massidda et al. 2023). These methods tell us when a high-level causal model is a simplification of a (typically more fine-grained) low-level model. To date, causal abstraction has been used to analyse weather patterns (Chalupka et al. 2016), human brains (J. Dubois, Oya, et al. 2020; J. Dubois, Eberhardt, et al. 2020), and deep learning models (Chalupka, Perona, and Eberhardt 2015; A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2021; Hu and Tian 2022; A. Geiger, Wu, et al. 2024; Z. Wu et al. 2023).
Imagine trying to understand a bustling city by tracking everyone’s movement. This “micro-level” perspective is overwhelming. Instead, we might analyze neighbourhoods (macro-level) to identify traffic patterns or economic activity. In physics, we call this coarse-graining. Causal abstraction asks a more statistical question: When does a simplified high-level model (macrovariables) accurately represent a detailed low-level system (microvariables)?
For example, a neural network classifies images using millions of neurons (microvariables). A causal abstraction might represent this as a high-level flowchart: Input Image → Detect Edges → Identify Shapes → Classify Object This flowchart is a macrovariable model that abstracts away neuronal details while preserving the “causal story” of how the network works.
Easy to say, harder to formalize.
Chalupka, Eberhardt, and Perona (2016) explains this idea with equivalence classes of variable states that induce a partition on the space of possible causal models. The fundamental object from this hierarchical perspective is a causal partition. Chalupka, Eberhardt, and Perona (2017) constructs some contrived examples. They work with discrete variables (or discretize continuous ones) to make it easy to discuss the measures of the sets in the partition. I’ll leave this work aside for now; it’s a nice intuition pump but too clunky for what I need.
2 Non-hierarchical models
A. Geiger, Ibeling, et al. (2024) generalize in an interesting direction; they consider equivalence classes over “messy” structures where, for example, microvariables aren’t neatly partitioned into macrovariables and may participate in multiple macrovariables. They also want to handle systems with loops. In the end, they argue it’s a unifying language for causality in machine learning, particularly for mechanistic interpretability and ablation studies.
A shortcoming of existing theory is that macrovariables cannot be represented by quantities formed from overlapping sets of microvariables. Just as with neural network models of human cognition (Smolensky, 1986), this is the typical situation in mechanistic interpretability, where high-level concepts are thought to be represented by modular ‘features’ distributed across individual neural activations […].
Our first contribution is to extend the theory of causal abstraction to remove this limitation, building heavily on previous work. The core issue is that typical hard and soft interventions replace variable mechanisms entirely, so they are unable to isolate quantities distributed across overlapping sets of microvariables. To address this, we consider a very general type of intervention—what we call interventionals—that maps from old mechanisms to new mechanisms. While this space of operations is generally unconstrained, we isolate special classes of interventionals that form intervention algebras, satisfying two key modularity properties. Such classes can essentially be treated as hard interventions with respect to a new (‘translated’) variable space. We elucidate this situation, generalising earlier work by Rubenstein et al. (2017) and Beckers and Halpern (2019).
2.1 Distributed alignment search
e.g. (Abraham et al. 2022; Arora, Jurafsky, and Potts 2024; A. Geiger, Wu, et al. 2024; Tigges et al. 2023)
3 Interventions
To validate abstractions, we use interventions — controlled changes to a system. There seem to be levels of abstraction.
- _Hard_interventions: Force variables to specific values (e.g., clamp a neuron’s activation); these are the classic Judea-Pearl-style interventions popularized by the do-calculus
- Soft interventions: These are like “distributional” assignments rather than “setting” a variable to a single value as in hard interventions; we assign it a distribution. I found this idea simple and intuitive in Correa and Bareinboim (2020); the presentation in A. Geiger, Ibeling, et al. (2024) was a little more opaque.
- In the next section we generalize these to Interventionals: generalized transformations of mechanisms (e.g., redistributing a concept across multiple neurons). This is a new idea in A. Geiger, Ibeling, et al. (2024), and I don’t have an intuition about it yet.
4 Generative Intervention models
Work from Schölkopf’s lab looks interesting. In Generative Intervention Models (GIMs) (Schneider et al. 2025), the authors build something that might be the missing operational piece between “micro” causal models and “macro” perturbations: The model is \[ p(x;\gamma)\;=\;\int p\bigl(x\mid I;M\bigr)\;p\bigl(I\mid \gamma,\phi\bigr)\,dI \]
- \(M\) is a structural causal model (graph \(G\), mechanisms \(\theta\)).
- \(I\) is an atomic intervention (which variables we hit and how their mechanisms change).
- \(\gamma\) are observable features of the perturbation (e.g. a drug and dose, or an edit/ablation spec).
- The learned translator \(p(I\mid \gamma,\phi)\) is parameterised by two functions: \(g_\phi(\gamma)\) predicts targets; \(h_\phi(I,\gamma)\) predicts interventional parameters. The authors train \(M\) and \(\phi\) jointly, and then approximate the posterior predictive \(p(x\mid D;\gamma)\approx p(x\mid M^*,\phi^*;\gamma)\).
Mapping onto causal abstraction:
- Macro→micro translator. \(p(I\mid\gamma,\phi)\) acts like a learnable \(\omega\)-map from high-level “knobs” to low-level interventions—exactly what abstraction frameworks need but rarely get from data.
- Coarse-graining of interventions. Two macros \(\gamma,\gamma'\) are abstractly equivalent if they induce (approximately) the same interventional distribution on \(M\). That’s a causal partition over perturbations, not just states.
- Pluggable semantics. Because \(p(x\mid I;M)\) is defined by swapping mechanisms, we can recover hard or soft interventions and—at least in spirit—move towards the general “interventionals”.
Caveats / research leads.
- Acyclic \(M\). The implementation penalizes cycles; true feedback systems need extensions.
- MAP over full Bayes. Using MAP instead of full Bayes means uncertainty about \(\omega\) and \(M\) isn’t fully propagated; that matters for abstraction-error accounting.
- Signal in \(\gamma\). If the macro descriptor doesn’t encode the mechanism of action, the learned translator won’t align.
- Intervention semantics. Bridging to fully general “interventionals” (overlapping, modular mechanism edits) isn’t done.
5 Factored space models
Garrabrant et al. (2024) looks like a slightly different take:
Causality plays an important role in understanding intelligent behavior, and there is a wealth of literature on mathematical models for causality, most of which is focused on causal graphs. Causal graphs are a powerful tool for a wide range of applications, in particular when the relevant variables are known and at the same level of abstraction. However, the given variables can also be unstructured data, like pixels of an image. Meanwhile, the causal variables, such as the positions of objects in the image, can be arbitrary deterministic functions of the given variables. Moreover, the causal variables may form a hierarchy of abstractions, in which the macro-level variables are deterministic functions of the micro-level variables. Causal graphs are limited when it comes to modeling this kind of situation. In the presence of deterministic relationships there is generally no causal graph that satisfies both the Markov condition and the faithfulness condition. We introduce factored space models as an alternative to causal graphs which naturally represent both probabilistic and deterministic relationships at all levels of abstraction. Moreover, we introduce structural independence and establish that it is equivalent to statistical independence in every distribution that factorizes over the factored space. This theorem generalizes the classical soundness and completeness theorem for d-separation.
6 LLM summary
Here be dragons! I used Perplexity to summarize all the strands of causal abstraction. I can guarantee even less than usual about the correctness of this summary.
Recent advances in causal abstraction theory have provided rigorous mathematical frameworks for analyzing systems at multiple levels of granularity while preserving causal structure. This report synthesizes the core contributions across key papers in this domain, examining both theoretical foundations and practical applications.
6.1 Formal Foundations of Causal Abstraction
The foundational work of Beckers & Halpern (2019) established τ-abstractions as a precise mechanism for mapping between causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their framework introduced:
- A three-component abstraction tuple (τ, ω, σ) that maps variables, interventions and outcomes between models
- Compositionality guarantees that ensure abstraction hierarchies remain causally consistent
- A distinction between exact and approximate abstractions, with error bounds (Beckers and Halpern 2020; Shin and Gerstenberg 2023)
Building on this, Rubenstein et al. (2017) first formalized the notion of exact transformations between structural causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their key insight was to establish intervention-preservation requirements using commutative diagrams:
\[ \begin{CD} \mathcal{I}_L @>\omega>> \mathcal{I}_H \\ @V{\sim}VV @VV{\sim}V \\ \mathcal{M}_L @>>\tau> \mathcal{M}_H \end{CD} \]
Here, ω maps low-level interventions \[\mathcal{I}_L\] to high-level \[\mathcal{I}_H\] while preserving outcome relationships via τ (Beckers and Halpern 2020; Beckers and Halpern 2019).
Rischel and Weichwald (2021) established advanced compositionality using category theory, proving that the error bounds satisfy:
\[ \epsilon(M \rightarrow M'') \leq \epsilon(M \rightarrow M') + \epsilon(M' \rightarrow M'') \]
They use enriched category structures (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020). Their framework introduced KL-divergence-based error metrics while maintaining causal semantics across transformations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023).
6.2 Approximation and Error Quantification
Beckers and Halpern (2019) introduced formal error metrics for approximate abstractions via:
- Intervention-specific divergence measures
- Worst-case error bounds across allowed interventions
- Probabilistic extensions handling observational uncertainty (Beckers and Halpern 2020; Shin and Gerstenberg 2023).
They operationalized this through error lattices, allowing analysis of approximation quality at different granularities (Beckers and Halpern 2020). Massidda et al. (2023) extended this to soft interventions and proved uniqueness conditions for intervention maps ω under mechanism-preservation constraints (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017).
Key theoretical results include:
- Compositionality of abstraction errors (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023)
- Explicit construction of ω maps via quotient spaces (Massidda et al. 2023)
- Duality between variable clustering and intervention preservation (Beckers and Halpern 2020; Beckers and Halpern 2019)
6.3 Applications Across Domains
6.3.1 Neuroscience
(J. Dubois, Oya, et al. 2020; D. Dubois and Prade 2020; J. Dubois, Eberhardt, et al. 2020) apply causal abstraction to neural population dynamics and demonstrate
- Valid abstractions from spiking models to mean-field approximations
- Emergent causal patterns in coarse-grained neural representations
- Intervention preservation across biological scales (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017)
6.3.2 Climate Science
Chalupka et al. (2016) showed how to abstract El Niño models from high-dimensional wind/temperature data through
- Variable clustering preserving causal connectivity
- Intervention consistency for climate predictions
- Validation through hurricane trajectory simulations (Beckers and Halpern 2020)
6.3.3 Deep Learning
(A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2022; A. R. Geiger 2023; A. Geiger, Wu, et al. 2024) developed interchange intervention techniques for analysing neural networks
- Alignment between model layers and symbolic reasoning steps
- Causal faithfulness metrics for transformer architectures
- Applications to NLP and computer vision models (A. Geiger et al. 2021; Chalupka, Eberhardt, and Perona 2017)
Their ANTRA framework lets us test whether neural networks implement known algorithmic structures through intervention graphs (A. Geiger et al. 2021).
7 Methodological Themes
- Intervention-Centric Formalization: All approaches centre intervention preservation as the core abstraction criterion (Beckers and Halpern 2020; Beckers and Halpern 2019; Massidda et al. 2023)
- Compositionality: Hierarchical error propagation and transform composition are fundamental requirements (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
- Approximation Metrics: KL-divergence, Wasserstein distance, and intervention-specific losses dominate (Beckers and Halpern 2020; Shin and Gerstenberg 2023)
- Algebraic Structures: Category theory and lattice frameworks provide mathematical foundations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
- Empirical Validation: Applications demonstrate abstraction viability through simulation and model testing (A. Geiger et al. 2021; Massidda et al. 2023)
