Causal abstraction

Coarse-graining for causal models

2025-02-24 — 2025-12-20

Wherein causal abstraction is presented as a formalism for mapping macro interventions to micro mechanisms via a learnable translator called interventionals, and intervention equivalence is treated as a partition over perturbations.

AI safety

approximation

Bayes

causality

generative

graphical models

language

machine learning

meta learning

neural nets

NLP

probabilistic algorithms

probability

statistics

stringology

time series

I ran into this field of research while trying to invent it, then realized I was years too late.

Causal “abstraction”-type approaches extend or replace traditional causal inference with relaxed or approximate causal modelling of interventions.

As such, we can probably think of them as formalizing coarse-graining for causal models.

We suspect that the notorious capacity for causal inference in LLMs might be built on these ideas, or at least understood in those terms.

Q: How do we bridge this to disentangled representation learning?

1 Causality in hierarchical systems

In the hierarchical setting, we consider a system made up of micro- and macro-states; we ask when many micro-states can be abstracted into a few macro-states in a causal sense.

Figure 2: Chalupka, Eberhardt, and Perona (2016) introduces a hierarchical causal model with observed macro input variables \(I\), observational variables \(J\) and a hidden variable \(H\), where each of those variables is comprised of many microvariables.

A. Geiger, Ibeling, et al. (2024) summarizes:

In some ways, studying modern deep learning models is like studying the weather or an economy: they involve large numbers of densely connected ‘microvariables’ with complex, non-linear dynamics. One way of reining in this complexity is to find ways of understanding these systems in terms of higher-level, more abstract variables (‘macrovariables’). For instance, the many microvariables might be clustered together into more abstract macrovariables. A number of researchers have been exploring theories of causal abstraction, providing a mathematical framework for causally analysing a system at multiple levels of detail (Chalupka, Eberhardt, and Perona 2017; Rubenstein et al. 2017; Beckers and Halpern 2019, 2019; Rischel and Weichwald 2021; Massidda et al. 2023). These methods tell us when a high-level causal model is a simplification of a (typically more fine-grained) low-level model. To date, causal abstraction has been used to analyse weather patterns (Chalupka et al. 2016), human brains (J. Dubois, Oya, et al. 2020; J. Dubois, Eberhardt, et al. 2020), and deep learning models (Chalupka, Perona, and Eberhardt 2015; A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2021; Hu and Tian 2022; A. Geiger, Wu, et al. 2024; Z. Wu et al. 2023).

Imagine trying to understand a bustling city by tracking everyone’s movement. This “micro-level” perspective is overwhelming. Instead, we might analyze neighbourhoods (macro-level) to identify traffic patterns or economic activity. In physics, we call this coarse-graining. Causal abstraction asks a more statistical question: When does a simplified high-level model (macrovariables) accurately represent a detailed low-level system (microvariables)?

For example, a neural network classifies images using millions of neurons (microvariables). A causal abstraction might represent this as a high-level flowchart: Input Image → Detect Edges → Identify Shapes → Classify Object This flowchart is a macrovariable model that abstracts away neuronal details while preserving the “causal story” of how the network works.

Easy to say, harder to formalize.

Chalupka, Eberhardt, and Perona (2016) explains this idea with equivalence classes of variable states that induce a partition on the space of possible causal models. The fundamental object from this hierarchical perspective is a causal partition. Chalupka, Eberhardt, and Perona (2017) constructs some contrived examples. They work with discrete variables (or discretize continuous ones) to make it easy to discuss the measures of the sets in the partition. I’ll leave this work aside for now; it’s a nice intuition pump but too clunky for what I need.

2 Non-hierarchical models

A. Geiger, Ibeling, et al. (2024) generalize in an interesting direction; they consider equivalence classes over “messy” structures where, for example, microvariables aren’t neatly partitioned into macrovariables and may participate in multiple macrovariables. They also want to handle systems with loops. In the end, they argue it’s a unifying language for causality in machine learning, particularly for mechanistic interpretability and ablation studies.

A shortcoming of existing theory is that macrovariables cannot be represented by quantities formed from overlapping sets of microvariables. Just as with neural network models of human cognition (Smolensky, 1986), this is the typical situation in mechanistic interpretability, where high-level concepts are thought to be represented by modular ‘features’ distributed across individual neural activations […].

Our first contribution is to extend the theory of causal abstraction to remove this limitation, building heavily on previous work. The core issue is that typical hard and soft interventions replace variable mechanisms entirely, so they are unable to isolate quantities distributed across overlapping sets of microvariables. To address this, we consider a very general type of intervention—what we call interventionals—that maps from old mechanisms to new mechanisms. While this space of operations is generally unconstrained, we isolate special classes of interventionals that form intervention algebras, satisfying two key modularity properties. Such classes can essentially be treated as hard interventions with respect to a new (‘translated’) variable space. We elucidate this situation, generalising earlier work by Rubenstein et al. (2017) and Beckers and Halpern (2019).

2.1 Distributed alignment search

e.g. (Abraham et al. 2022; Arora, Jurafsky, and Potts 2024; A. Geiger, Wu, et al. 2024; Tigges et al. 2023)

3 Interventions

To validate abstractions, we use interventions — controlled changes to a system. There seem to be levels of abstraction.

_Hard_interventions: Force variables to specific values (e.g., clamp a neuron’s activation); these are the classic Judea-Pearl-style interventions popularized by the do-calculus
Soft interventions: These are like “distributional” assignments rather than “setting” a variable to a single value as in hard interventions; we assign it a distribution. I found this idea simple and intuitive in Correa and Bareinboim (2020); the presentation in A. Geiger, Ibeling, et al. (2024) was a little more opaque.
In the next section we generalize these to Interventionals: generalized transformations of mechanisms (e.g., redistributing a concept across multiple neurons). This is a new idea in A. Geiger, Ibeling, et al. (2024), and I don’t have an intuition about it yet.

4 Generative Intervention models

Work from Schölkopf’s lab looks interesting. In Generative Intervention Models (GIMs) (Schneider et al. 2025), the authors build something that might be the missing operational piece between “micro” causal models and “macro” perturbations: The model is \[ p(x;\gamma)\;=\;\int p\bigl(x\mid I;M\bigr)\;p\bigl(I\mid \gamma,\phi\bigr)\,dI \]

\(M\) is a structural causal model (graph \(G\), mechanisms \(\theta\)).
\(I\) is an atomic intervention (which variables we hit and how their mechanisms change).
\(\gamma\) are observable features of the perturbation (e.g. a drug and dose, or an edit/ablation spec).
The learned translator \(p(I\mid \gamma,\phi)\) is parameterised by two functions: \(g_\phi(\gamma)\) predicts targets; \(h_\phi(I,\gamma)\) predicts interventional parameters. The authors train \(M\) and \(\phi\) jointly, and then approximate the posterior predictive \(p(x\mid D;\gamma)\approx p(x\mid M^*,\phi^*;\gamma)\).

Mapping onto causal abstraction:

Macro→micro translator. \(p(I\mid\gamma,\phi)\) acts like a learnable \(\omega\)-map from high-level “knobs” to low-level interventions—exactly what abstraction frameworks need but rarely get from data.
Coarse-graining of interventions. Two macros \(\gamma,\gamma'\) are abstractly equivalent if they induce (approximately) the same interventional distribution on \(M\). That’s a causal partition over perturbations, not just states.
Pluggable semantics. Because \(p(x\mid I;M)\) is defined by swapping mechanisms, we can recover hard or soft interventions and—at least in spirit—move towards the general “interventionals”.

Caveats / research leads.

Acyclic \(M\). The implementation penalizes cycles; true feedback systems need extensions.
MAP over full Bayes. Using MAP instead of full Bayes means uncertainty about \(\omega\) and \(M\) isn’t fully propagated; that matters for abstraction-error accounting.
Signal in \(\gamma\). If the macro descriptor doesn’t encode the mechanism of action, the learned translator won’t align.
Intervention semantics. Bridging to fully general “interventionals” (overlapping, modular mechanism edits) isn’t done.

5 Factored space models

Garrabrant et al. (2024) looks like a slightly different take:

Causality plays an important role in understanding intelligent behavior, and there is a wealth of literature on mathematical models for causality, most of which is focused on causal graphs. Causal graphs are a powerful tool for a wide range of applications, in particular when the relevant variables are known and at the same level of abstraction. However, the given variables can also be unstructured data, like pixels of an image. Meanwhile, the causal variables, such as the positions of objects in the image, can be arbitrary deterministic functions of the given variables. Moreover, the causal variables may form a hierarchy of abstractions, in which the macro-level variables are deterministic functions of the micro-level variables. Causal graphs are limited when it comes to modeling this kind of situation. In the presence of deterministic relationships there is generally no causal graph that satisfies both the Markov condition and the faithfulness condition. We introduce factored space models as an alternative to causal graphs which naturally represent both probabilistic and deterministic relationships at all levels of abstraction. Moreover, we introduce structural independence and establish that it is equivalent to statistical independence in every distribution that factorizes over the factored space. This theorem generalizes the classical soundness and completeness theorem for d-separation.

6 LLM summary

Here be dragons! I used Perplexity to summarize all the strands of causal abstraction. I can guarantee even less than usual about the correctness of this summary.

Recent advances in causal abstraction theory have provided rigorous mathematical frameworks for analyzing systems at multiple levels of granularity while preserving causal structure. This report synthesizes the core contributions across key papers in this domain, examining both theoretical foundations and practical applications.

6.1 Formal Foundations of Causal Abstraction

The foundational work of Beckers & Halpern (2019) established τ-abstractions as a precise mechanism for mapping between causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their framework introduced:

A three-component abstraction tuple (τ, ω, σ) that maps variables, interventions and outcomes between models
Compositionality guarantees that ensure abstraction hierarchies remain causally consistent
A distinction between exact and approximate abstractions, with error bounds (Beckers and Halpern 2020; Shin and Gerstenberg 2023)

Building on this, Rubenstein et al. (2017) first formalized the notion of exact transformations between structural causal models (Beckers and Halpern 2020, 2020; Beckers and Halpern 2019). Their key insight was to establish intervention-preservation requirements using commutative diagrams:

\[ \begin{CD} \mathcal{I}_L @>\omega>> \mathcal{I}_H \\ @V{\sim}VV @VV{\sim}V \\ \mathcal{M}_L @>>\tau> \mathcal{M}_H \end{CD} \]

Here, ω maps low-level interventions \[\mathcal{I}_L\] to high-level \[\mathcal{I}_H\] while preserving outcome relationships via τ (Beckers and Halpern 2020; Beckers and Halpern 2019).

Rischel and Weichwald (2021) established advanced compositionality using category theory, proving that the error bounds satisfy:

\[ \epsilon(M \rightarrow M'') \leq \epsilon(M \rightarrow M') + \epsilon(M' \rightarrow M'') \]

They use enriched category structures (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020). Their framework introduced KL-divergence-based error metrics while maintaining causal semantics across transformations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023).

6.2 Approximation and Error Quantification

Beckers and Halpern (2019) introduced formal error metrics for approximate abstractions via:

Intervention-specific divergence measures
Worst-case error bounds across allowed interventions
Probabilistic extensions handling observational uncertainty (Beckers and Halpern 2020; Shin and Gerstenberg 2023).

They operationalized this through error lattices, allowing analysis of approximation quality at different granularities (Beckers and Halpern 2020). Massidda et al. (2023) extended this to soft interventions and proved uniqueness conditions for intervention maps ω under mechanism-preservation constraints (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017).

Key theoretical results include:

Compositionality of abstraction errors (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023)
Explicit construction of ω maps via quotient spaces (Massidda et al. 2023)
Duality between variable clustering and intervention preservation (Beckers and Halpern 2020; Beckers and Halpern 2019)

6.3 Applications Across Domains

6.3.1 Neuroscience

(J. Dubois, Oya, et al. 2020; D. Dubois and Prade 2020; J. Dubois, Eberhardt, et al. 2020) apply causal abstraction to neural population dynamics and demonstrate

Valid abstractions from spiking models to mean-field approximations
Emergent causal patterns in coarse-grained neural representations
Intervention preservation across biological scales (Massidda et al. 2023; Chalupka, Eberhardt, and Perona 2017)

6.3.2 Climate Science

Chalupka et al. (2016) showed how to abstract El Niño models from high-dimensional wind/temperature data through

Variable clustering preserving causal connectivity
Intervention consistency for climate predictions
Validation through hurricane trajectory simulations (Beckers and Halpern 2020)

6.3.3 Deep Learning

(A. Geiger, Richardson, and Potts 2020; A. Geiger et al. 2022; A. R. Geiger 2023; A. Geiger, Wu, et al. 2024) developed interchange intervention techniques for analysing neural networks

Alignment between model layers and symbolic reasoning steps
Causal faithfulness metrics for transformer architectures
Applications to NLP and computer vision models (A. Geiger et al. 2021; Chalupka, Eberhardt, and Perona 2017)

Their ANTRA framework lets us test whether neural networks implement known algorithmic structures through intervention graphs (A. Geiger et al. 2021).

7 Methodological Themes

Intervention-Centric Formalization: All approaches centre intervention preservation as the core abstraction criterion (Beckers and Halpern 2020; Beckers and Halpern 2019; Massidda et al. 2023)
Compositionality: Hierarchical error propagation and transform composition are fundamental requirements (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
Approximation Metrics: KL-divergence, Wasserstein distance, and intervention-specific losses dominate (Beckers and Halpern 2020; Shin and Gerstenberg 2023)
Algebraic Structures: Category theory and lattice frameworks provide mathematical foundations (Rischel and Weichwald 2021; Zennaro, Turrini, and Damoulas 2023; Beckers and Halpern 2020)
Empirical Validation: Applications demonstrate abstraction viability through simulation and model testing (A. Geiger et al. 2021; Massidda et al. 2023)

8 Incoming

Dalcy, But Where do the Variables of my Causal Model come from? — LessWrong

9 References

Abraham, D’Oosterlinck, Feder, et al. 2022. “CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior.”

Arora, Jurafsky, and Potts. 2024. “CausalGym: Benchmarking Causal Interpretability Methods on Linguistic Tasks.”

Bahadori, Chalupka, Choi, et al. 2017. “Causal Regularization.” In.

Beckers, and Halpern. 2019. “Abstracting Causal Models.” Proceedings of the AAAI Conference on Artificial Intelligence.

Beckers, and Halpern. 2020. “Approximate Causal Abstraction.” In Proceedings of Machine Learning Research.

Chalupka, Bischoff, Perona, et al. 2016. “Unsupervised Discovery of El Nino Using Causal Feature Learning on Microlevel Climate Data.” In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI’16.

Chalupka, Eberhardt, and Perona. 2016. “Multi-Level Cause-Effect Systems.” In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.

———. 2017. “Causal Feature Learning: An Overview.” Behaviormetrika.

Chalupka, Perona, and Eberhardt. 2015. “Visual Causal Feature Learning.” In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. UAI’15.

Chen, Sun, and Du. 2025. “Causal Discovery via Quantile Partial Effect.”

Correa, and Bareinboim. 2020. “A Calculus for Stochastic Interventions:Causal Effect Identification and Surrogate Experiments.” Proceedings of the AAAI Conference on Artificial Intelligence.

D’Angelo, Croce, and Flammarion. 2024. “Selective Induction Heads: How Transformers Select Causal Structures in Context.” In.

Dhir, Diaconu, Lungu, et al. 2025. “Estimating Interventional Distributions with Uncertain Causal Graphs Through Meta-Learning.” In.

Dominici, Barbiero, Zarlenga, et al. 2025. “Causal Concept Graph Models: Beyond Causal Opacity in Deep Learning.”

Dubois, Julien, Eberhardt, Paul, et al. 2020. “Personality Beyond Taxonomy.” Nature Human Behaviour.

Dubois, Julien, Oya, Tyszka, et al. 2020. “Causal Mapping of Emotion Networks in the Human Brain: Framework and Initial Findings.” Neuropsychologia, The Neural Basis of Emotion,.

Dubois, Didier, and Prade. 2020. “A Glance at Causality Theories for Artificial Intelligence.” In A Guided Tour of Artificial Intelligence Research: Volume I: Knowledge Representation, Reasoning and Learning.

Garrabrant, Mayer, Wache, et al. 2024. “Factored Space Models: Towards Causality Between Levels of Abstraction.”

Geiger, Atticus Reed. 2023. “Uncovering and Inducing Interpretable Causal Structure in Deep Learning Models.”

Geiger, Atticus, Ibeling, Zur, et al. 2024. “Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability.”

Geiger, Atticus, Lu, Icard, et al. 2021. “Causal Abstractions of Neural Networks.” In Advances in Neural Information Processing Systems. NIPS ’21.

Geiger, Atticus, Richardson, and Potts. 2020. “Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation.” In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP.

Geiger, Atticus, Wu, Lu, et al. 2022. “Inducing Causal Structure for Interpretable Neural Networks.” In Proceedings of the 39th International Conference on Machine Learning.

Geiger, Atticus, Wu, Potts, et al. 2024. “Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations.” In Proceedings of the Third Conference on Causal Learning and Reasoning.

Halpern. 2016. Actual causality.

Halpern, and Piermont. 2024. “Subjective Causality.”

Hoel. 2017. “When the Map Is Better Than the Territory.” Entropy.

Hubinger, Jermyn, Treutlein, et al. 2023. “Conditioning Predictive Models: Risks and Strategies.”

Hu, and Tian. 2022. “Neuron Dependency Graphs: A Causal Abstraction of Neural Networks.” In Proceedings of the 39th International Conference on Machine Learning.

Jørgensen, Gresele, and Weichwald. 2025. “What Is Causal about Causal Models and Representations?”

Kiciman, Ness, Sharma, et al. 2024. “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.” Transactions on Machine Learning Research.

Kinney, and Lombrozo. 2024. “Building Compressed Causal Models of the World.” Cognitive Psychology.

Komanduri, Wu, Wu, et al. 2024. “From Identifiable Causal Representations to Controllable Counterfactual Generation: A Survey on Causal Generative Modeling.”

Le, Do, and Tran. 2025. “Learning Structural Causal Models from Ordering: Identifiable Flow Models.” Proceedings of the AAAI Conference on Artificial Intelligence.

Lorch, Krause, and Schölkopf. 2024. “Causal Modeling with Stationary Diffusions.” In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics.

Massidda, Geiger, Icard, et al. 2023. “Causal Abstraction with Soft Interventions.” In Proceedings of the Second Conference on Causal Learning and Reasoning.

Massidda, Magliacane, and Bacciu. 2024. “Learning Causal Abstractions of Linear Structural Causal Models.”

Müller, Hollmann, Arango, et al. 2021. “Transformers Can Do Bayesian Inference.” In.

Rajendran, Buchholz, Aragam, et al. 2024. “From Causal to Concept-Based Representation Learning.” Advances in Neural Information Processing Systems.

Richens, and Everitt. 2024. “Robust Agents Learn Causal World Models.”

Rischel, and Weichwald. 2021. “Compositional Abstraction Error and a Category of Causal Models.” In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence.

Robertson, Reuter, Guo, et al. 2025. “Do-PFN: In-Context Learning for Causal Effect Estimation.”

Rubenstein, Weichwald, Bongers, et al. 2017. “Causal Consistency of Structural Equation Models.” In Uncertainty in Artificial Intelligence.

Schneider, Lorch, Kilbertus, et al. 2025. “Generative Intervention Models for Causal Perturbation Modeling.”

Shin, and Gerstenberg. 2023. “Learning What Matters: Causal Abstraction in Human Inference.”

Soulos, McCoy, Linzen, et al. 2020. “Discovering the Compositional Structure of Vector Representations with Role Learning Networks.” In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP.

Tigges, Hollinsworth, Geiger, et al. 2023. “Linear Representations of Sentiment in Large Language Models.”

Vashishtha, Reddy, Kumar, et al. 2024. “Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference.” In.

Wang, Chen, Tang, et al. 2024. “Disentangled Representation Learning.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Wu, Zhengxuan, Geiger, Icard, et al. 2023. “Interpretability at Scale: Identifying Causal Mechanisms in Alpaca.” In Advances in Neural Information Processing Systems.

Wu, Anpeng, Kuang, Zhu, et al. 2024. “Causality for Large Language Models.”

Xi, and Bloem-Reddy. 2023. “Indeterminacy in Generative Models: Characterization and Strong Identifiability.” In.

Yan, Acartürk, and Tajer. 2025. “Reward-Oriented Causal Representation Learning.” In.

Zennaro, Turrini, and Damoulas. 2023. “Quantifying Consistency and Information Loss for Causal Abstraction Learning.” In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence.

Zhou, Xie, Hao, et al. 2023. “Emerging Synergies in Causality and Deep Generative Models: A Survey.”