Causal abstraction

Coarse-graining for causal models

2025-02-24 — 2025-08-14

Wherein a learned translator is introduced that maps high-level perturbations to low-level interventions, interventional algebras are developed, and macro equivalence is made operational

AI safety

approximation

Bayes

causal

generative

graphical models

language

machine learning

meta learning

neural nets

NLP

probabilistic algorithms

probability

statistics

stringology

time series

I ran into this field of research while trying to invent it, years too late. It’s an interesting analysis suited to relaxed or approximate causal modelling of interventions. It seems to formalize coarse-graining for causal models.

We suspect that the notorious causal inference in LLMs might be built on these ideas or understood in terms of them.

Q: How do we bridge this to disentangled representation learning?

1 Causality in hierarchical systems

In the hierarchical setting, we consider a system made of micro- and macro-states; we ask when many micro-states can be abstracted into a few macro-states in a causal sense.

Figure 2: Chalupka, Eberhardt, and Perona (2016) introduces a hierarchical causal model with observed macro input variables \(I\), observational variables \(J\) and a hidden variable \(H\), where each of those variables is comprised of many microvariables.

A. Geiger, Ibeling, et al. (2024) summarizes:

In some ways, studying modern deep learning models is like studying the weather or an economy: they involve large numbers of densely connected ‘microvariables’ with complex, non-linear dynamics. One way of reining in this complexity is to find ways of understanding these systems in terms of higher-level, more abstract variables (‘macrovariables’). For instance, the many microvariables might be clustered together into more abstract macrovariables. A number of researchers have been exploring theories of causal abstraction, providing a mathematical framework for causally analysing a system at multiple levels of detail (Chalupka, Eberhardt, and Perona 2017; Rubenstein et al. 2017; Beckers and Halpern 2019, 2019; Rischel and Weichwald 2021; Massidda et al. 2023). These methods tell us when a high-level causal model is a simplification of a (typically more fine-grained) low-level model. To date, causal abstraction has been used to analyse weather patterns (Chalupka et al. 2016), human brains (J. Dubois, Oya, et al. 2020; J. Dubois, Eberhardt, et al. 2020), and deep learning models (Chalupka, Perona, and Eberhardt 2015; A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2021; Hu and Tian 2022; A. Geiger, Wu, et al. 2024; Z. Wu et al. 2023).

Imagine trying to understand a bustling city by tracking everyone’s movement. This “micro-level” perspective is overwhelming. Instead, we might analyze neighbourhoods (macro-level) to identify traffic patterns or economic activity. In physics, we call this coarse-graining. Causal abstraction asks a more statistical question: When does a simplified high-level model (macrovariables) accurately represent a detailed low-level system (microvariables)?

For example, a neural network classifies images using millions of neurons (microvariables). A causal abstraction might represent this as a high-level flowchart: Input Image → Detect Edges → Identify Shapes → Classify Object This flowchart is a macrovariable model that abstracts away neuronal details while preserving the “causal story” of how the network works.

Easy to say, harder to formalize.

Chalupka, Eberhardt, and Perona (2016) explains this idea with equivalence classes of variable states that induce a partition on the space of possible causal models. The fundamental object from this hierarchical perspective is a causal partition. Chalupka, Eberhardt, and Perona (2017) constructs some contrived worked examples. They work in terms of discrete variables (or discretizations of continuous ones) to make it easy to discuss the measure of the sets implicated in the partition. I’ll leave this work aside for now; it’s a nice intuition pump but too clunky for what I need.

2 Non-hierarchical models

A. Geiger, Ibeling, et al. (2024) generalizes further; they consider equivalence classes over “messy” structures where, for example, microvariables are not neatly partitioned into macrovariables and may be involved in multiple macrovariables. They would further like to handle systems with loops. In the end, they argue it’s a unifying language for causality in machine learning, particularly for mechanistic interpretability and ablation studies.

A shortcoming of existing theory is that macrovariables cannot be represented by quantities formed from overlapping sets of microvariables. Just as with neural network models of human cognition (Smolensky, 1986), this is the typical situation in mechanistic interpretability, where high-level concepts are thought to be represented by modular ‘features’ distributed across individual neural activations […].

Our first contribution is to extend the theory of causal abstraction to remove this limitation, building heavily on previous work. The core issue is that typical hard and soft interventions replace variable mechanisms entirely, so they are unable to isolate quantities distributed across overlapping sets of microvariables. To address this, we consider a very general type of intervention—what we call interventionals—that maps from old mechanisms to new mechanisms. While this space of operations is generally unconstrained, we isolate special classes of interventionals that form intervention algebras, satisfying two key modularity properties. Such classes can essentially be treated as hard interventions with respect to a new (‘translated’) variable space. We elucidate this situation, generalising earlier work by Rubenstein et al. (2017) and Beckers and Halpern (2019).

2.1 Distributed alignment search

e.g. (Abraham et al. 2022; Arora, Jurafsky, and Potts 2024; A. Geiger, Wu, et al. 2024; Tigges et al. 2023)

3 Interventions

To validate abstractions, we use interventions — controlled changes to a system. There seem to be levels of abstraction.

_Hard_interventions: Force variables to specific values (e.g., clamping a neuron’s activation), which are the classic Judea-Pearl-style interventions made famous by the do-calculus
Soft interventions: These look like “distributional” assignments or something like that. Rather than “setting” a variable to a value as in hard interventions, we assign it a distribution. This idea I found simple and intuitive in Correa and Bareinboim (2020); the presentation in A. Geiger, Ibeling, et al. (2024) was a little more opaque.
In the next section we generalize these to Interventionals: Generalized transformations of mechanisms (e.g., redistributing a concept across multiple neurons). This is the new thing in A. Geiger, Ibeling, et al. (2024), and I have no intuition about it yet.

4 Generative Intervention models

From Schölkopf’s lab, it looks interesting. In Generative Intervention Models (GIMs) (Schneider et al. 2025), they build something that might be the missing operational piece between “micro” causal models and “macro” perturbations: The model is [TODO clarify] \[ p(x;\gamma)\;=\;\int p\bigl(x\mid I;M\bigr)\;p\bigl(I\mid \gamma,\phi\bigr)\,dI \]

\(M\) is a structural causal model (graph \(G\), mechanisms \(\theta\)).
\(I\) is an atomic intervention (which variables to hit and how their mechanisms change).
\(\gamma\) are observable features of the perturbation (e.g. a drug + dose, or an edit/ablation spec).
The learned translator \(p(I\mid \gamma,\phi)\) is parameterised by two functions: \(g_\phi(\gamma)\) predicts targets; \(h_\phi(I,\gamma)\) predicts interventional parameters. The authors train \(M\) and \(\phi\) jointly, then approximate the posterior predictive \(p(x\mid D;\gamma)\approx p(x\mid M^*,\phi^*;\gamma)\).

Mapping onto causal abstraction:

Macro→micro translator. \(p(I\mid\gamma,\phi)\) acts like a learnable \(\omega\)-map from high-level “knobs” to low-level interventions—exactly what abstraction frameworks need but rarely get from data.
Coarse-graining of interventions. Two macros \(\gamma,\gamma'\) are abstractly equivalent if they induce (approximately) the same interventional distribution on \(M\). That’s a causal partition over perturbations, not just states.
Pluggable semantics. Because \(p(x\mid I;M)\) is defined by swapping mechanisms, we can recover hard or soft interventions and—at least in spirit—move toward the general “interventionals”.

Caveats / research leads.

Acyclic \(M\). The implementation penalizes cycles; true feedback systems need extensions.
MAP over full Bayes. Using MAP instead of full Bayes means uncertainty about \(\omega\) and \(M\) isn’t fully propagated; that matters for abstraction-error accounting.
Signal in \(\gamma\). If the macro descriptor doesn’t encode the mechanism of action, the learned translator won’t align.
Intervention semantics. Bridging to fully general “interventionals” (overlapping, modular mechanism edits) isn’t done.

5 LLM summary

Here be dragons! I used Perplexity to summarise all the strands of causal abstraction. I can guarantee even less than usual about the correctness of this summary.

Recent advances in causal abstraction theory have provided rigorous mathematical frameworks for analysing systems at multiple levels of granularity while preserving causal structure. This report synthesises the core contributions across key papers in this domain, examining both theoretical foundations and practical applications.

5.1 Formal Foundations of Causal Abstraction

The foundational work of Beckers & Halpern (2019) established τ-abstractions as a precise mechanism for mapping between causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their framework introduced:

A three-component abstraction tuple (τ, ω, σ) mapping variables, interventions, and outcomes between models
Compositionality guarantees that ensure abstraction hierarchies maintain causal consistency
A distinction between exact and approximate abstractions, with error bounds (Beckers and Halpern 2020; Shin and Gerstenberg 2023)

Building on this, Rubenstein et al. (2017) first formalized the notion of exact transformations between structural causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their key insight was to establish intervention-preservation requirements using commutative diagrams:

\[ \begin{CD} \mathcal{I}_L @>\omega>> \mathcal{I}_H \\ @V{\sim}VV @VV{\sim}V \\ \mathcal{M}_L @>>\tau> \mathcal{M}_H \end{CD} \]

Here, ω maps low-level interventions \[\mathcal{I}_L\] to high-level \[\mathcal{I}_H\] while preserving outcome relationships through τ (Beckers and Halpern 2020; Beckers and Halpern 2019).

Rischel and Weichwald (2021) advanced compositionality through category theory, proving that the error bounds satisfy:

\[ \epsilon(M \rightarrow M'') \leq \epsilon(M \rightarrow M') + \epsilon(M' \rightarrow M'') \]

Using enriched category structures (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020). Their framework introduced KL-divergence based error metrics while maintaining causal semantics across transformations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023).

5.2 Approximation and Error Quantification

Beckers and Halpern (2019) introduced formal error metrics for approximate abstractions through:

Intervention-specific divergence measures
Worst-case error bounds across allowed interventions
Probabilistic extensions handling observational uncertainty (Beckers and Halpern 2020; Shin and Gerstenberg 2023).

This was operationalized through error lattices where approximation quality could be analyzed at different granularities (Beckers and Halpern 2020). Massidda et al. (2023) extended this to soft interventions, proving uniqueness conditions for intervention maps ω under mechanism preservation constraints (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017).

Key theoretical results include:

Compositionality of abstraction errors (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023)
Explicit construction of ω maps via quotient spaces (Massidda et al. 2023)
Duality between variable clustering and intervention preservation (Beckers and Halpern 2020; Beckers and Halpern 2019)

5.3 Applications Across Domains

5.3.1 Neuroscience

(J. Dubois, Oya, et al. 2020; D. Dubois and Prade 2020; J. Dubois, Eberhardt, et al. 2020) apply causal abstraction to neural population dynamics, demonstrating

Valid abstractions from spiking models to mean-field approximations
Emergent causal patterns in coarse-grained neural representations
Intervention preservation across biological scales (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017)

5.3.2 Climate Science

Chalupka et al. (2016) showed how El Niño models could be abstracted from high-dimensional wind/temperature data through

Variable clustering preserving causal connectivity
Intervention consistency for climate predictions
Validation through hurricane trajectory simulations (Beckers and Halpern 2020)

5.3.3 Deep Learning

(A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2022; A. R. Geiger 2023; A. Geiger, Wu, et al. 2024) developed interchange intervention techniques for analysing neural networks

Alignment between model layers and symbolic reasoning steps
Causal faithfulness metrics for transformer architectures
Applications to NLP and computer vision models (A. Geiger et al. 2021; Chalupka, Eberhardt, and Perona 2017)

Their ANTRA framework enabled testing whether neural networks implement known algorithmic structures through intervention graphs (A. Geiger et al. 2021).

6 Methodological Themes

Intervention-Centric Formalization: All approaches center intervention preservation as the core abstraction criterion (Beckers and Halpern 2020; Beckers and Halpern 2019; Massidda et al. 2023)
Compositionality: Hierarchical error propagation and transform composition are fundamental requirements (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
Approximation Metrics: KL-divergence, Wasserstein distance, and intervention-specific losses dominate (Beckers and Halpern 2020; Shin and Gerstenberg 2023)
Algebraic Structures: Category theory and lattice frameworks provide mathematical foundations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
Empirical Validation: Applications demonstrate abstraction viability through simulation and model testing (A. Geiger et al. 2021; Massidda et al. 2023)

7 Incoming

Dalcy, But Where do the Variables of my Causal Model come from? — LessWrong

8 References

Abraham, D’Oosterlinck, Feder, et al. 2022. “CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior.”

Arora, Jurafsky, and Potts. 2024. “CausalGym: Benchmarking Causal Interpretability Methods on Linguistic Tasks.”

Bahadori, Chalupka, Choi, et al. 2017. “Causal Regularization.” In.

Beckers, and Halpern. 2019. “Abstracting Causal Models.” Proceedings of the AAAI Conference on Artificial Intelligence.

Beckers, and Halpern. 2020. “Approximate Causal Abstraction.” In Proceedings of Machine Learning Research.

Chalupka, Bischoff, Perona, et al. 2016. “Unsupervised Discovery of El Nino Using Causal Feature Learning on Microlevel Climate Data.” In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI’16.

Chalupka, Eberhardt, and Perona. 2016. “Multi-Level Cause-Effect Systems.” In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.

———. 2017. “Causal Feature Learning: An Overview.” Behaviormetrika.

Chalupka, Perona, and Eberhardt. 2015. “Visual Causal Feature Learning.” In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. UAI’15.

Correa, and Bareinboim. 2020. “A Calculus for Stochastic Interventions:Causal Effect Identification and Surrogate Experiments.” Proceedings of the AAAI Conference on Artificial Intelligence.

D’Angelo, Croce, and Flammarion. 2024. “Selective Induction Heads: How Transformers Select Causal Structures in Context.” In.

Dominici, Barbiero, Zarlenga, et al. 2025. “Causal Concept Graph Models: Beyond Causal Opacity in Deep Learning.”

Dubois, Julien, Eberhardt, Paul, et al. 2020. “Personality Beyond Taxonomy.” Nature Human Behaviour.

Dubois, Julien, Oya, Tyszka, et al. 2020. “Causal Mapping of Emotion Networks in the Human Brain: Framework and Initial Findings.” Neuropsychologia, The Neural Basis of Emotion,.

Dubois, Didier, and Prade. 2020. “A Glance at Causality Theories for Artificial Intelligence.” In A Guided Tour of Artificial Intelligence Research: Volume I: Knowledge Representation, Reasoning and Learning.

Garrabrant, Mayer, Wache, et al. 2024. “Factored Space Models: Towards Causality Between Levels of Abstraction.”

Geiger, Atticus Reed. 2023. “Uncovering and Inducing Interpretable Causal Structure in Deep Learning Models.”

Geiger, Atticus, Ibeling, Zur, et al. 2024. “Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability.”

Geiger, Atticus, Lu, Icard, et al. 2021. “Causal Abstractions of Neural Networks.” In Advances in Neural Information Processing Systems. NIPS ’21.

Geiger, Atticus, Richardson, and Potts. 2020. “Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation.” In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP.

Geiger, Atticus, Wu, Lu, et al. 2022. “Inducing Causal Structure for Interpretable Neural Networks.” In Proceedings of the 39th International Conference on Machine Learning.

Geiger, Atticus, Wu, Potts, et al. 2024. “Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations.” In Proceedings of the Third Conference on Causal Learning and Reasoning.

Halpern, and Piermont. 2024. “Subjective Causality.”

Hoel. 2017. “When the Map Is Better Than the Territory.” Entropy.

Hubinger, Jermyn, Treutlein, et al. 2023. “Conditioning Predictive Models: Risks and Strategies.”

Hu, and Tian. 2022. “Neuron Dependency Graphs: A Causal Abstraction of Neural Networks.” In Proceedings of the 39th International Conference on Machine Learning.

Jørgensen, Gresele, and Weichwald. 2025. “What Is Causal about Causal Models and Representations?”

Kiciman, Ness, Sharma, et al. 2024. “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.” Transactions on Machine Learning Research.

Kinney, and Lombrozo. 2024. “Building Compressed Causal Models of the World.” Cognitive Psychology.

Komanduri, Wu, Wu, et al. 2024. “From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling.”

Massidda, Geiger, Icard, et al. 2023. “Causal Abstraction with Soft Interventions.” In Proceedings of the Second Conference on Causal Learning and Reasoning.

Massidda, Magliacane, and Bacciu. 2024. “Learning Causal Abstractions of Linear Structural Causal Models.”

Müller, Hollmann, Arango, et al. 2021. “Transformers Can Do Bayesian Inference.” In.

Richens, and Everitt. 2024. “Robust Agents Learn Causal World Models.”

Rischel, and Weichwald. 2021. “Compositional Abstraction Error and a Category of Causal Models.” In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence.

Rubenstein, Weichwald, Bongers, et al. 2017. “Causal Consistency of Structural Equation Models.” In Uncertainty in Artificial Intelligence.

Schneider, Lorch, Kilbertus, et al. 2025. “Generative Intervention Models for Causal Perturbation Modeling.”

Shin, and Gerstenberg. 2023. “Learning What Matters: Causal Abstraction in Human Inference.”

Soulos, McCoy, Linzen, et al. 2020. “Discovering the Compositional Structure of Vector Representations with Role Learning Networks.” In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP.

Tigges, Hollinsworth, Geiger, et al. 2023. “Linear Representations of Sentiment in Large Language Models.”

Vashishtha, Reddy, Kumar, et al. 2024. “Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference.” In.

Wang, Chen, Tang, et al. 2024. “Disentangled Representation Learning.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Wu, Zhengxuan, Geiger, Icard, et al. 2023. “Interpretability at Scale: Identifying Causal Mechanisms in Alpaca.” In Advances in Neural Information Processing Systems.

Wu, Anpeng, Kuang, Zhu, et al. 2024. “Causality for Large Language Models.”

Zennaro, Turrini, and Damoulas. 2023. “Quantifying Consistency and Information Loss for Causal Abstraction Learning.” In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence.