# UNSW Stats reading group 2016 - Causal DAGs

An introduction to conditional independence DAGs and their use for causal data.

October 17, 2016 — October 21, 2016

**tl;dr**: These are the notes from a reading group I led in 2016 on causal DAGs. When I have time to expand these notes into complete sentences, I will migrate the good bits to an expanded and improved notebook on causal DAGS. For now, see the updated and fixed version of this.

We follow Pearl’s summary (Pearl (2009a)). (approx sections 1-3 of the Pearl paper.)

In particular, I want to get to the identification of causal effects given an existing causal DAG from observational data with unobserved covariates via criteria such as the back-door criterion. We’ll see.

Approach: casual, motivate Pearl’s pronouncements, without deriving everything from axioms. Not statistical; will not answer the question of how we infer graph structure from data. Will skip many complexities by taking several slightly over-restrictive conditions, which we would relax if we were not doing this in 1 hour.

Not covered: UGs, PDAGs…

Assumptions: No-one here is an expert in this DAG graphical formalism for causal inference.

## 1 Motivational examples

- Wet pavements
- Obesity contagion
- Nobel prizes and chocolate
- Simpson’s paradox
- etc

## 2 Machinery

We are interested in representing influence between variables in a non-parametric fashion.

Our main tool to do this will be conditional independence DAGs, and causal use of these. Alternative name: “Bayesian Belief networks”. (Overloads “Bayesian”, so not used here)

### 2.1 DAGs

DAG: Directed (probabilistic) graphical model. Graph defined, as usual, by a set of vertexes and edges.

\[ \mathcal{G}=(\mathbf{V},E) \]

We show the directions of edges by writing them as arrows.

For nodes \(X,Y\in V\) we write \(X\rightarrow Y\) to mean there is a directed edge joining them.

Familiar from, e.g., Structural equation models, hierarchical models, expert systems. General graph theory…

A graph with *directed* edges, and no cycles. (you cannot return to the same starting node traveling only *forward* along the arrows)

We need some terminology.

- Parents
- The parents of a node \(X\) in a graph are all nodes joined to it by an incoming arrow, \(\operatorname{parents}(X)=\{Y\in V:Y\rightarrow X\}.\)
- Children
- similarly, \(\operatorname{children}(X)=\{Y\in V:X\rightarrow Y}.\)
- Co-parent
- \(\operatorname{coparents}(X)=\{Y\in V:\exists Z\in V \text{ s.t. } X\rightarrow Z\text{ and }Y\rightarrow Z}.\)

*Ancestors* and *descendants* should be clear as well. For convenience, we define \(X\in\operatorname{parents}(X)\)

### 2.2 Random variables

I will deal with finite collections of random variables \(\mathbf{V}\).

For simplicity of exposition, each of the RVs will be supported on \(\mathcal{X}_i\subset\mathbb{Z}\), so that we may work with pmfs, and write \(p(X_i|X_j)\) for the pmf. I may write \(p(x_i|x_j)\) to mean \(p(X_i=x_i|X_j=x_j)\).

Also, we are working with *sets of random variables* rather than *sets of events* and the discrete state space reduces the need to discuss sets of events.

Extension to continuous RVs, or arbitrary RVs is trivial for everything I discuss here. (A challenge is if the probabilities are not all strictly positive.)

Motivation in terms of structural models.

\[\begin{aligned} X_6 &= f_6(X_4, X_3, \varepsilon_6) \\ X_5 &= f_5(X_4, X_3, \varepsilon_5) \\ X_4 &= f_4(X_3, X_2, X_1, \varepsilon_4) \\ X_3 &= f_3(\varepsilon_3) \\ X_2 &= f_2(X_1, \varepsilon_2) \\ X_1 &= f_1(\varepsilon_1) \\ \end{aligned}\]

Without further information about the forms of \(f_i\) or \(\varepsilon_i\), our assumptions have constrained our conditional independence relations to permit a particular factorisation of the mass function:

\[ p(x_6, x_5, x_4, x_3, x_2, x_1) = p(x_1) p(x_2|x_1) p(x_3) p(x_4|x_1, x_2, x_3) p(x_5|x_3,x_4) p(x_6|x_3,x_4) \]

We are “nonparametric” in the sense that working with this conditional factorisation does not require any further parametric assumptions on the model.

However, we would like to proceed from this factorisation to conditional independence, which is non-trivial. Specifically, we would like to know which variables are conditionally independent of others, given such an (assumed) factorisation.

More notation: We write

\[ X \perp Y|Z \]

for \(X\) independent of \(Y\) given \(Z\).

We also use this notation for sets of random variables, and will bold them when it is necessary to emphasise this.

\[ \mathbf{X} \perp \mathbf{Y}|\mathbf{Z} \]

Questions:

- \(X_2\perp X_3\)?
- \(X_2\perp X_3|X_1\)?
- \(X_2\perp X_3|X_4\)?

However, this product notation is not illuminating; we use a graph formalism. That’s where the DAGs come in.

This will proceed in 3 steps

The graphs will describe

*conditional factorisation relations*.We do some work to construct from these relations some

*conditional independence relations*, which may be read off the graph.From these relations plus a causal interpretation we derive rules for

*identification of causal relations*If we get further than that, it will be all about coffee

Anyway, a joint distribution \(p(\mathbb{X})\) decomposes according to a directed graph \(G\) if we may factor it

\[ p(X_1,X_2,\dots,X_v)=\prod_{X=1}^v p(X_i|\operatorname{parents}(X_i)) \]

Uniqueness?

It would be tempting to suppose that a node is independent of its children given its parents or somesuch. But things are not quite so simple.

Questions:

- \(\text{Sprinkler}\perp \text{Rain}\)?
- \(\text{Sprinkler}\perp \text{Rain}|\text{Wet season}\)?
- \(\text{Sprinkler}\perp \text{Rain}|\text{Wet pavement}\)?
- \(\text{Sprinkler}\perp \text{Rain}|\text{Wet season}, \text{Wet pavement}\)?

To make precise statements about conditional independence relations we do more work.

We need new graph vocabulary *and* conditional independence vocabulary.

Axiomatic characterisation of conditional independence. (Pearl 2008; Steffen L. Lauritzen 1996).

**Theorem**: ((Pearl 2008)) For disjoint subsets \(\mathbf{W},\mathbf{X},\mathbf{Y},\mathbf{Z}\subseteq\mathbf{V}.\)

Then the relation \(\cdot\perp\cdot|\cdot\) satisfies the following relations:

\[\begin{aligned} \mathbf{X} \perp \mathbf{Z} |\mathbf{Y} & \Leftrightarrow & \mathbf{Z}\perp \mathbf{X} | \mathbf{Y} && \text{ Symmetry }&\\ \mathbf{X} \perp \mathbf{Y}\cup \mathbf{W} |\mathbf{Z} & \Rightarrow & \mathbf{X} \perp \mathbf{Y} \text{ and } \mathbf{X} \perp \mathbf{W} && \text{ Decomposition }&\\ \mathbf{X} \perp \mathbf{Y}\cup \mathbf{W} |\mathbf{Z} & \Rightarrow & \mathbf{X} \perp \mathbf{Y}|\mathbf{Z}\cup\mathbf{W} && \text{ Weak Union }&\\ \mathbf{X} \perp \mathbf{Y} |\mathbf{Z} \text{ and } \mathbf{X} \perp \mathbf{W}|\mathbf{Z}\cup \mathbf{Y} |\mathbf{Z} & \Rightarrow & \mathbf{X} \perp \mathbf{Y}\cup \mathbf{W}|\mathbf{Z} && \text{ Contraction }&\\ \mathbf{X} \perp \mathbf{Y} |\mathbf{Z}\cup \mathbf{W} \text{ and } \mathbf{X} \perp \mathbf{W} |\mathbf{Z}\cup \mathbf{Y} & \Rightarrow & \mathbf{X}\perp \mathbf{W}\cup\mathbf{Y} | \mathbf{Z} && \text{ Intersection } & (*)\\ \end{aligned}\]

(*) The Intersection axiom only holds for strictly positive distributions.

How can we relate this to the topology of the graph?

The flow of conditional information does not correspond exactly to the marginal factorisation, but it relates. (mention UG connections?)

**Definition**: A set \(\mathbf{S}\) *blocks* a path \(\pi\) from X to Y in a DAG \(\mathcal{G}\) if either

There is a node \(a\in\pi\) which

*is not*a collider on \(\pi\) such that \(a\in\mathbf{S}\)There is a node \(b\in\pi\) which

*is*a collider on \(\pi\) and \(\operatorname{descendants}(b)\cap\mathbf{S}=\emptyset\)

If a path is not blocked, it is *active*.

**Definition**: A set \(\mathbf{S}\) *d-separates* two subsets of nodes \(\mathbf{X},\mathbf{Y}\subseteq\mathcal{G}\) if it blocks *every* path between any pair of nodes \((A,B)\) such that \(A\in\mathbf{X},\, B\in\mathbf{Y}.\)

This looks ghastly and unintuitive, but we have to live with it because it is the shortest path to making simple statements about conditional independence DAGs without horrible circumlocutions, or starting from undirected graphs, which is tedious.

**Theorem**: (Pearl 2008; Steffen L. Lauritzen 1996). If the joint distribution of \(\mathbf{V}\) factorises according to the DAG \(\mathbf{G}\) then for two subsets of variables \(\mathbf{X}\perp\mathbf{Y}|\mathbf{S}\) iff \(\mathbf{S}\) *d*-separates \(\mathbf{X}\) and \(\mathbf{Y}\).

This puts us in a position to make non-awful, more intuitive statements about the conditional independence relationships that we may read off the DAG.

**Corollary**: The DAG Markov property.

\[ X \perp \operatorname{descendants}(X)^C|\operatorname{parents}(X) \]

**Corollary**: The DAG Markov blanket.

Define

\[ \operatorname{blanket}(X):= \operatorname{parents}(X)\cup \operatorname{children}(X)\cup \operatorname{coparents}(X) \]

Then

\[ X\perp \operatorname{blanket}(X)^C|\operatorname{blanket}(X) \]

## 3 Causal interpretation

Finally!

We have a DAG \(\mathcal{G}\) and a set of variables \(\mathbf{V}\) to which we wish to give a causal interpretation.

Assume

- The \(\mathbf{V}\) factors according to \(\mathcal{G}\)
- \(X\rightarrow Y\) means “causes” (The Causal Markov property)
- We additionally assume
*faithfulness*, that is, that \(X\leftrightsquigarrow Y\) iff there is a path connecting them.

So, are we done? Only if correlation equals causation.

We add the additional condition that

- all the relevant variables are included in the graph. (We coyly avoid making this precise)

The BBC raised one possible confounding variable:

[…] Eric Cornell, who won the Nobel Prize in Physics in 2001, told Reuters “I attribute essentially all my success to the very large amount of chocolate that I consume. Personally I feel that milk chocolate makes you stupid… dark chocolate is the way to go. It’s one thing if you want a medicine or chemistry Nobel Prize but if you want a physics Nobel Prize it pretty much has got to be dark chocolate.”

Finally, we need to discuss the relationship between conditional dependence and causal effect. This is the difference between, say,

\[ P(\text{Wet pavement}|\text{Sprinkler}=on) \]

and

\[ P(\text{Wet pavement}|\operatorname{do}(\text{Sprinkler}=on)) \]

Called “truncated factorization” in the paper. \(\text{do}\)-calculus and graph surgery.

If we know \(P\), this is relatively easy. Marginalize out all influences on the causal variable of interest, which we show graphically as wiping out a link.

Now suppose we are not given complete knowledge of \(P\), but only *some* of the conditional distributions. (There are *unobservable variables*). This is the setup of observational studies and epidemiology and so on.

What variables *must* we know the conditional distributions of in order to know the conditional effect? That is, we call a set of covariates \(\mathbf{S}\) an *admissible set* (or *sufficient set*) with respect to identifying the effect of \(X\) on \(Y\) if

\[ p(Y=y|do(X=x))=\sum_{\mathbf{s}} P(Y=y|X=x,\mathbf{S}=\mathbf{s}) P(\mathbf{S}=\mathbf{s}) \]

**Criterion 1**: The parents of a cause are an admissible set (Pearl 2009a).

**Criterion 2**: The back door criterion.

A set \(\mathbf{S}\) such that

\(\mathbf{S}\cap\operatorname{descendants}(X)=\emptyset\)

\(\mathbf{S}\) blocks all paths which start with an arrow

*into*\(\mathbf{X}\)

This is a sufficient condition.

Causal properties of sufficient sets:

\[ P(Y=y|\operatorname{do}(X=x),S=s)=P(Y=y|X=x,S=s) \]

Hence

\[ P(Y=y|\operatorname{do}(X=x),S=s)=\sum_sP(Y=y|X=x,S=s)P(S=s) \]

## 4 Examples

- \(i,j\) are individuals,
- \(Z\) denote observed traits,
- \(X\) denote latent traits
- \(Y\) denote observed outcomes
- \(A\) is a network tie

\(X_i\) d-separates \(Y_i(t)\) from \(A_{ij}\). Since \(X_i\) is latent and unobserved, \(Y_i(t) \leftarrow X_i \rightarrow A_{ij}\) is a confounding path from \(Y_i(t)\) to \(A_{ij}\). Likewise \(Y_j(t-1)\leftarrow X_j \rightarrow A_{ij}\) is a confounding path from \(Y_i(t-1)\) to \(A_{ij}\). Thus, \(Y_i(t)\) and \(Y_i(t-1)\) are

d-connected when conditioning on all the observed (boxed) variables […] . Hence the direct effect of \(Y_i(t)\) on \(Y_i(t-1)\) is not identifiable

## 5 Recommended reading

People recommend Koller and Friedman, which includes many different flavours of DAG model and many different methods, (Koller and Friedman 2009) but it didn’t suit me, being somehow too detailed and too non-specific at the same time.

Spirtes et al (Spirtes, Glymour, and Scheines 2001) and Pearl (Pearl 2009a) are readable. See also Pearl’s edited highlights (Pearl 2009b). Lauritzen ((Steffen L. Lauritzen 1996)) is clear but the details of the constructions are long and detailed and more general than here. (Partially directed graphs.)

Lauritzen’s shorter introduction (Steffen L. Lauritzen 2000) is nice if you can get it; Not overwhelming, starts with a slightly more general formalism (DAGs as a special case of PDAGs, moral graphs everywhere). Murphy’s textbook (Murphy 2012) has a minimal introduction intermingled with some related models, with a more ML, “expert systems”-flavoured and more Bayesian formalism.

## 6 References

*Proceedings of the National Academy of Sciences*.

*Conditional Specification of Statistical Models*.

*Proceedings of the National Academy of Sciences*.

*AAAI*.

*arXiv:1507.03652 [Math, Stat]*.

*The Annals of Applied Statistics*.

*Mathematical Methods of Operations Research*.

*Annual Review of Statistics and Its Application*.

*Statistical Methods in Medical Research*.

*arXiv:1411.1557 [Stat]*.

*The Annals of Statistics*.

*Biometrika*.

*Handbook of Causal Analysis for Social Research*. Handbooks of Sociology and Social Research.

*arXiv:1405.1868 [Stat]*.

*American Journal of Sociology*.

*Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics*.

*Learning in Graphical Models*.

*The Handbook of Brain Theory and Neural Networks*.

*Handbook of Neural Networks and Brain Theory*.

*Journal of Machine Learning Research*.

*arXiv Preprint arXiv:1510.04740*.

*IJCAI*.

*Probabilistic Graphical Models : Principles and Techniques*.

*Graphical Models*. Oxford Statistical Science Series.

*Complex Stochastic Systems*.

*Journal of the Royal Statistical Society. Series B (Methodological)*.

*arXiv Preprint arXiv:1307.5636*.

*Nature Methods*.

*The Annals of Statistics*.

*Proceedings of the National Academy of Sciences*.

*New England Journal of Medicine*.

*Proceedings of the 24th International Conference on Machine Learning*.

*Machine learning: a probabilistic perspective*. Adaptive computation and machine learning series.

*Learning Bayesian Networks*.

*Social Networks*.

*Proceedings of the Second AAAI Conference on Artificial Intelligence*. AAAI’82.

*Artificial Intelligence*.

*Probabilistic reasoning in intelligent systems: networks of plausible inference*. The Morgan Kaufmann series in representation and reasoning.

*Statistics Surveys*.

*Causality: Models, Reasoning and Inference*.

*arXiv:1501.01332 [Stat]*.

*2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton)*.

*Statistical Science*.

*Use of Directed Acyclic Graphs*.

*arXiv:1607.06565 [Physics, Stat]*.

*Sociological Methods & Research*.

*The Journal of Machine Learning Research*.

*arXiv:1411.2127 [Stat]*.

*Proceedings of the Conference on Empirical Methods in Natural Language Processing*.

*Causation, Prediction, and Search*. Adaptive Computation and Machine Learning.

*Statistical Methods in Medical Research*.

*The Annals of Mathematical Statistics*.

*Exploring Artificial Intelligence in the New Millennium*.

*arXiv:1202.3775 [Cs, Stat]*.