Probabilistic graphical models

Judea Pearl performing graph surgery

The term graphical model in the context of statistics means a particular thing: a family of ways to relate inference in multivariate models in terms of Calculating marginal and conditional probabilities to graphs that describe how probabilities factorise, where graph here is in the sense of network theory, i.e. a collection of nodes connected by edges. Here are some of those ways:

Loeliger’s 2004 zoo of the predominant graphical models

It turns out that switching back and forth between these different formalisms makes some things easier to do and if you are luck also easier to understand. Within this area, there are several specialties and a lot of material. This is a landing page pointing to actual content.

Thematically, content to this theme is scattered across graphical models in inference, learning graphs from data, diagramming graphical models, learning causation from data plus graphs, quantum graphical models, and yet more pages.

Barber (2012)’s Taxonomy of graphical models

Introductory texts {##intro-texts}

Barber (2012) and Lauritzen (1996) are rigorous introductions. Murphy (2012) presents a minimal introduction intermixed with particular related methods, so takes you straight to applications, although personally I found that confusing. For use in causality, Pearl (2009) and Spirtes, Glymour, and Scheines (2001) are readable.

People recommend me Koller and Friedman (2009) which is probably the most comprehensive, but I found it too comprehensive, to the point it was hard to see the forest for the trees. Maybe better as thing to deepen your understanding when you already know what is going on.

What are plates?

Invented in Buntine (1994), the plate notation is how we introduce the notion of multiple variables with a regular, i.e. conditionally independent relation to existing variables. These are extremely important if you want to observe more than one data point. AFAICT, really digging deep into data as just another node is what makes Koller and Friedman (2009) a classic text book. But it is really glossed over in lots of papers, especially early ones, where the problem of estimating parameters from data is glossed over.

Concretely, consider Dustin Tran’s example of the kind of dimensionality we are working with:

Dustin Tran: A hierarchical model, with latent variables \(\alpha_k\) defined locally per group and latent variables \(\phi\) defined globally to be shared across groups.

We are motivated by hierarchical models[…]. Formally, let \(y_{n k}\) be the \(n^{t h}\) data point in group \(k\), with a total of \(N_{k}\) data points in group \(k\) and \(K\) many groups. We model the data using local latent variables \(\alpha_{k}\) associated to a group \(k\), and using global latent variables \(\phi\) which are shared across groups.[…] (Figure) The posterior distribution of local variables \(\alpha_{k}\) and global variables \(\phi\) is \[ p(\alpha, \phi \mid \mathbf{y}) \propto p(\phi \mid \mathbf{y}) \prod_{k=1}^{K}\left[p\left(\alpha_{k} \mid \beta\right) \prod_{n=1}^{N_{K}} p\left(y_{n k} \mid \alpha_{k}, \phi\right)\right] \] The benefit of distributed updates over the independent factors is immediate. For example, suppose the data consists of 1,000 data points per group (with 5,000 groups); we model it with 2 latent variables per group and 20 global latent variables.

How many dimensions are we integrating over now? 10,020. The number of intermediate dimensions in our data grows very rapidly as we add observations, even if the final marginal is of low dimension. However, some of those dimensions are independent of others, and so may be factored away.

Directed graphs

Graphs of conditional, directed independence are a convenient formalism for many models. These are also called Bayes nets presumably because the relationships encoded in these graphs have utility in the automatic application of Bayes rule. These models have a natural interpretation in terms of causation and structural models. See directed graphical models.

Undirected, a.k.a. Markov graphs

a.k.a Markov random fields, Markov random networks… These have a natural interpretation in terms of energy-based models. See undirected graphical models.

Factor graphs

A unifying formalism for the directed and undirected graphical models. Simpler in some ways, harder in others. See factor graphs.

Inference on

A key use of this graphical structure is that it can make inference local, in that you can have different compute nodes, which examine part of the data/part of the model and pass messages back and forth to do inference over the entire thing. It is easy to say this, but making practical an performant algorithms this way is… well, it is a whole field. See graphical models in inference,


All of the probabilistic programming languages end up needing to account for graphical model structure in practice, so maybe start there.


Barber, David. 2012. Bayesian Reasoning and Machine Learning. Cambridge ; New York: Cambridge University Press.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer.
Buntine, W. L. 1994. Operations for Learning with Graphical Models.” Journal of Artificial Intelligence Research 2 (1): 159–225.
Charniak, Eugene. 1991. “Bayesian Networks Without Tears.” AI Magazine 12 (4): 50.
Da Costa, Lancelot, Karl Friston, Conor Heins, and Grigorios A. Pavliotis. 2021. Bayesian Mechanics for Stationary Processes.” arXiv:2106.13830 [Math-Ph, Physics:nlin, q-Bio], June.
Dawid, A. Philip. 1979. Conditional Independence in Statistical Theory.” Journal of the Royal Statistical Society. Series B (Methodological) 41 (1): 1–31.
———. 1980. Conditional Independence for Statistical Operations.” The Annals of Statistics 8 (3): 598–617.
Jordan, Michael Irwin. 1999. Learning in Graphical Models. Cambridge, Mass.: MIT Press.
Koller, Daphne, and Nir Friedman. 2009. Probabilistic Graphical Models : Principles and Techniques. Cambridge, MA: MIT Press.
Lauritzen, Steffen L. 1996. Graphical Models. Oxford Statistical Science Series. Clarendon Press.
Levine, Sergey. 2018. Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” arXiv:1805.00909 [Cs, Stat], May.
Montanari, Andrea. 2011. Lecture Notes for Stat 375 Inference in Graphical Models.”
Murphy, Kevin P. 2012. Machine learning: a probabilistic perspective. 1 edition. Adaptive computation and machine learning series. Cambridge, MA: MIT Press.
Obermeyer, Fritz, Eli Bingham, Martin Jankowiak, Du Phan, and Jonathan P. Chen. 2020. Functional Tensors for Probabilistic Programming.” arXiv:1910.10775 [Cs, Stat], March.
Pearl, Judea. 2008. Probabilistic reasoning in intelligent systems: networks of plausible inference. Rev. 2. print., 12. [Dr.]. The Morgan Kaufmann series in representation and reasoning. San Francisco, Calif: Kaufmann.
———. 2009. Causality: Models, Reasoning and Inference. Cambridge University Press.
Pearl, Judea, Dan Geiger, and Thomas Verma. 1989. Conditional Independence and Its Representations.” Kybernetika 25 (7): 33–44.
Sadeghi, Kayvan. 2020. On Finite Exchangeability and Conditional Independence.” Electronic Journal of Statistics 14 (2): 2773–97.
Spirtes, Peter, Clark Glymour, and Richard Scheines. 2001. Causation, Prediction, and Search. Second Edition. Adaptive Computation and Machine Learning. The MIT Press.
Wainwright, Martin J., and Michael I. Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Vol. 1. Foundations and Trends® in Machine Learning. Now Publishers.
Wainwright, Martin, and Michael I Jordan. 2005. “A Variational Principle for Graphical Models.” In New Directions in Statistical Signal Processing. Vol. 155. MIT Press.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.