Learning graphical models from data

Also, causal discovery, structure discovery

Learning the independence graph structure from data in a graphical model. A particular sparse model selection problem where the model is hierarchical, or possibly even non-directional. For bonus sophistication we might even try to learn causal effects from data.

There are a few ways we can learn graphical models. The obvious one, to my mind, is to use a Bayesian network to learn the structure. conditional independence test, an awareness of multiple testing and graph theory. But also Bayesian sampling from possible graph structures is a thing apparently. There are other approaches too.

Bayesian learning of Bayesian networks

I am indebted to Dario Draca for introducing this field to me. Her recommended references are:

I think the introduction to this paper gives an accessible description of the Bayesian approach to Bayesian network structure learning, including how to define parameter priors for discrete and Gaussian networks (the former case being potentially relevant to your signal detection/classification problem).

If you are particularly interested, the introduction to this paper also gives some more information on computing marginal likelihoods (i.e. likelihood of data conditional only on the graph structure) for discrete Bayesian networks.

This one gives a rigorous derivation of the marginal graph likelihood for Gaussian networks:

Guido Consonni, Objective Bayes Model Selection of Gaussian Essential Graphs with Observational and Interventional Data.

Graphical models based on Directed Acyclic Graphs (DAGs) represent a powerful tool for investigating dependencies among variables. It is well known that one cannot distinguish between DAGs encoding the same set of conditional independencies (Markov equivalent DAGs) using only observational data. However, the space of all DAGs can be partitioned into Markov equivalence classes, each being represented by a unique Essential Graph (EG), also called Completed Partially Directed Graph (CPDAG). In some fields, in particular genomics, one can have both observational and interventional data, the latter being produced after an exogenous perturbation of some variables in the system, or from randomized intervention experiments. Interventions destroy the original causal structure, and modify the Markov property of the underlying DAG, leading to a finer partition of DAGs into equivalence classes, each one being represented by an Interventional Essential Graph (I-EG) (Hauser and Buehlmann). In this talk we consider Bayesian model selection of EGs under the assumption that the variables are jointly Gaussian. In particular, we adopt an objective Bayes approach, based on the notion of fractional Bayes factor, and obtain a closed form expression for the marginal likelihood of an EG. Next we construct a Markov chain to explore the EG space under a sparsity constraint, and propose an MCMC algorithm to approximate the posterior distribution over the space of EGs. Our methodology, which we name Objective Bayes Essential graph Search (OBES), allows to evaluate the inferential uncertainty associated to any features of interest, for instance the posterior probability of edge inclusion. An extension of OBES to deal simultaneously with observational and interventional data is also presented: this involves suitable modifications of the likelihood and prior, as well as of the MCMC algorithm.

Classic methods using independence tests on graphs

Many. In my time in the lectures of Marloes Maathuis I learnt some of the theory, but TBH everything has fallen out my head now, since I have not used them in practice. Notable works are Colombo and Maathuis (2014); Drton and Maathuis (2017); Heinze-Deml, Maathuis, and Meinshausen (2018); Maathuis, Kalisch, and Bühlmann (2009); Maathuis et al. (2010).

Most of these seem to boil down to the cases where belief propagation is well-behaved, to wit, linear-Gaussian and discrete RVs. In these, dependence is simple (in the Gaussian case, essentially, two things are independence if their shared entry in the cross-precision matrix is zero). If we can find a way of estimating actual zeros in that matrix, then what? Often we want causal distributions, so the question remains: can we convert the implied undirected graph into a directed one? tl;dr: sometimes. This is one of those cases where the Bayesian method comes out much cleaner; averaging over possible graphs is a natural way of thinking about this.

For a less causal/intervention focused method, see Nonparanormal skeptic (🏗) which combines semiparametric regression with non-parametric graph inference.

Researcher inferring optimal graph surgery

Learning by continuous optimization

A new trick in the arsenal from those neural network nerds. Xun Zheng, Bryon Aragam and Chen Dan in their blog post Learning DAGs with Continuous Optimization introduce NOTEARS. This is an interesting bit of work AFAICT. Download from xunzheng/notears, and read the papers :

Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint. In this paper, we introduce a fundamentally different strategy: We formulate the structure learning problem as a purely continuous optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting problem can be efficiently solved by standard numerical algorithms, which also makes implementation effortless. The proposed method outperforms existing ones, without imposing any structural assumptions on the graph such as bounded treewidth or in-degree.

Key insight:

…that the k-th power of the adjacency matrix of a graph counts the number of k-step paths from one node to another. In other words, if the diagonal of the matrix power turns out to be all zeros, there [are] no k-step cycles in the graph. Then to characterize acyclicity, we just need to set this constraint for all k=1,2,…,d, eliminating cycles of all possible length.

Has this gone anywhere? TODO: read Zheng et al. (2020).

A different appproach is Lorch et al. (2021):

Continuous characterization of acyclic graphs Orthogonal to the work on Bayesian inference, Zheng et al. (2018); have recently proposed a differentiable characterization of acyclic graphs for structure learning. In this work, we adopt the formulation of Yu et al. (2019), who show that a graph with adjacency matrix $$\mathbf{G} \in\{0,1\}^{d \times d}$$ does not have any cycles if and only if $$h(\mathbf{G})=0$$, where $h(\mathbf{G}):=\operatorname{tr}\left[\left(\mathbf{I}+\frac{1}{d} \mathbf{G}\right)^{d}\right]-d .$

Tools

DIBS

larslorch/dibs: Joint Bayesian inference of graph and parameters of general Bayes nets

This is the Python JAX implementation for DiBS (Lorch et al., 2021), a fully differentiable method for joint Bayesian inference of the DAG and parameters of general, causal Bayesian networks.

In this implementation, DiBS inference is performed with the particle variational inference method SVGD . Since DiBS and SVGD operate on continuous tensors and solely rely on Monte Carlo estimation and gradient ascent-like updates, the inference code leverages efficient vectorized operations, automatic differentiation, just-in-time compilation, and hardware acceleration, fully implemented with JAX.

Their code example is impressive:

In this example, we use DiBS to generate 10 DAG and parameter samples from the joint posterior over Gaussian Bayes nets with means modeled by neural networks.

from dibs.inference import JointDiBS
from dibs.target import make_nonlinear_gaussian_model
import jax.random as random
key = random.PRNGKey(0)

# simulate some data
key, subk = random.split(key)
data, model = make_nonlinear_gaussian_model(key=subk, n_vars=20)

# sample 10 DAG and parameter particles from the joint posterior
dibs = JointDiBS(x=data.x, inference_model=model)
key, subk = random.split(key)
gs, thetas = dibs.sample(key=subk, n_particles=10, steps=1000)

In the above, the keyword argument x for JointDiBS is a matrix of shape [N, d] and could be any real-world data set.

Cripes.

bnlearn

bnlearn learns belief networks, i.e. directed graphical models. It is restricted to the classic exact belief propgation case, i.e. multinomial or gaussian models.

bnlearn implements the following constraint-based structure learning algorithms:

• PC (the stable version);
• Grow-Shrink (GS);
• Incremental Association Markov Blanket (IAMB);
• Fast Incremental Association (Fast-IAMB);
• Interleaved Incremental Association (Inter-IAMB);
• Incremental Association with FDR Correction (IAMB-FDR);
• Max-Min Parents & Children (MMPC);
• Semi-Interleaved Hiton-PC (SI-HITON-PC);
• Hybrid Parents & Children (HPC);

the following score-based structure learning algorithms:

• Hill Climbing (HC);
• Tabu Search (Tabu);

the following hybrid structure learning algorithms:

• Max-Min Hill Climbing (MMHC);
• Hybrid HPC (H2PC);
• General 2-Phase Restricted Maximization (RSMAX2);

the following local discovery algorithms:

• Chow-Liu;
• ARACNE;

and the following Bayesian network classifiers:

• naive Bayes;
• Tree-Augmented naive Bayes (TAN).

Discrete (multinomial) and continuous (multivariate normal) data sets are supported, both for structure and parameter learning. The latter can be performed using either maximum likelihood or Bayesian estimators.

sparsebn

Compare with sparsebn:

A new R package for learning sparse Bayesian networks and other graphical models from high-dimensional data via sparse regularization. Designed from the ground up to handle:

• Experimental data with interventions
• Mixed observational / experimental data
• High-dimensional data with p >> n
• Datasets with thousands of variables (tested up to p=8000)
• Continuous and discrete data

The emphasis of this package is scalability and statistical consistency on high-dimensional datasets. […] For more details on this package, including worked examples and the methodological background, please see our new preprint.

Overview

The main methods for learning graphical models are:

• estimate.dag for directed acyclic graphs (Bayesian networks).
• estimate.precision for undirected graphs (Markov random fields).
• estimate.covariance for covariance matrices.

Currently, estimation of precision and covariances matrices is limited to Gaussian data.

caus2e

MLResearchAtOSRAM/cause2e: The cause2e package provides tools for performing an end-to-end causal analysis of your data.

The main contribution of cause2e is the integration of two established causal packages that have currently been separated and cumbersome to combine:

• Causal discovery methods from the py-causal package, which is a Python wrapper around parts of the Java TETRAD software. It provides many algorithms for learning the causal graph from data and domain knowledge.
• Causal reasoning methods from the DoWhy package, which is the current standard for the steps of a causal analysis starting from a known causal graph and data

TETRAD (source, tutorial) is a tool for discovering and visualising and calculating giant empirical DAGs, including general graphical inference and causality. It’s written by eminent causality inference people.

Tetrad is a program which creates, simulates data from, estimates, tests, predicts with, and searches for causal and statistical models. The aim of the program is to provide sophisticated methods in a friendly interface requiring very little statistical sophistication of the user and no programming knowledge. It is not intended to replace flexible statistical programming systems such as Matlab, Splus or R. Tetrad is freeware that performs many of the functions in commercial programs such as Netica, Hugin, LISREL, EQS and other programs, and many discovery functions these commercial programs do not perform. …

The Tetrad programs describe causal models in three distinct parts or stages: a picture, representing a directed graph specifying hypothetical causal relations among the variables; a specification of the family of probability distributions and kinds of parameters associated with the graphical model; and a specification of the numerical values of those parameters.

py-causal is a wrapper around this for python, and R-causal for R.

skgmm

skggm (python) does the Gaussian thing but also has a nice sparsification and good explanation.

The core estimator provided in skggm is QuicGraphLasso which is a scikit-learn compatible interface to QUIC, a proximal Newton-type algorithm that solves the graphical lasso (2) objective.

References

Azadkia, Mona, and Sourav Chatterjee. 2019. arXiv:1910.12327 [Cs, Math, Stat], December.
Bayati, M., and A. Montanari. 2012. IEEE Transactions on Information Theory 58 (4): 1997–2017.
Besserve, Michel, Arash Mehrjou, Rémy Sun, and Bernhard Schölkopf. 2019. In arXiv:1812.03253 [Cs, Stat].
Bühlmann, Peter, and Sara van de Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. 2011 edition. Heidelberg ; New York: Springer.
Buntine, W.L. 1996. IEEE Transactions on Knowledge and Data Engineering 8 (2): 195–210.
Cai, T. Tony. 2017. Annual Review of Statistics and Its Application 4 (1): 423–46.
Chau, Siu Lun, Jean-François Ton, Javier González, Yee Whye Teh, and Dino Sejdinovic. 2021. June.
Colombo, Diego, and Marloes H. Maathuis. 2014. “Order-Independent Constraint-Based Causal Structure Learning.” J. Mach. Learn. Res. 15 (1): 3741–82.
Colombo, Diego, Marloes H. Maathuis, Markus Kalisch, and Thomas S. Richardson. 2012. The Annals of Statistics 40 (1): 294–321.
Cox, D. R., and H. S. Battey. 2017. Proceedings of the National Academy of Sciences 114 (32): 8592–95.
Dezfouli, Amir, Edwin V Bonilla, and Richard Nock. 2018. “Variational Network Inference: Strong and Stable with Concrete Support.” In, 10.
Drton, Mathias, and Marloes H. Maathuis. 2017. Annual Review of Statistics and Its Application 4 (1): 365–93.
Foygel, Rina, and Mathias Drton. 2010. In Advances in Neural Information Processing Systems 23, edited by J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, 604–12. Curran Associates, Inc.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2008. Biostatistics 9 (3): 432–41.
Fu, Fei, and Qing Zhou. 2013. Journal of the American Statistical Association 108 (501): 288–300.
Gao, Ming, Yi Ding, and Bryon Aragam. 2020. arXiv:2006.11970 [Cs, Math, Stat], November.
Geer, Sara van de. 2014. In arXiv:1403.7023 [Math, Stat]. Vol. 131.
Geng, Zhi, Yue Liu, Chunchen Liu, and Wang Miao. 2019. Annual Review of Statistics and Its Application 6 (1): 103–24.
Gnecco, Nicola, Nicolai Meinshausen, Jonas Peters, and Sebastian Engelke. 2021. The Annals of Statistics 49 (3): 1755–78.
Gogate, Vibhav, William Webb, and Pedro Domingos. 2010. In Advances in Neural Information Processing Systems, 748–56.
Gu, Jiaying, and Qing Zhou. 2020. Journal of Machine Learning Research 21 (158): 1–31.
Hallac, David, Jure Leskovec, and Stephen Boyd. 2015. arXiv:1507.00280 [Cs, Math, Stat], July.
Harris, Naftali, and Mathias Drton. 2013. Journal of Machine Learning Research 14 (1): 3365–83.
Heinze-Deml, Christina, Marloes H. Maathuis, and Nicolai Meinshausen. 2018. Annual Review of Statistics and Its Application 5 (1): 371–91.
Hinton, Geoffrey E., Simon Osindero, and Kejie Bao. 2005. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 128–35. Citeseer.
Hsieh, Cho-Jui, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep D. Ravikumar. 2014. Journal of Machine Learning Research 15 (1): 2911–47.
Huang, Biwei, Kun Zhang, Jiji Zhang, Joseph Ramsey, Ruben Sanchez-Romero, Clark Glymour, and Bernhard Schölkopf. 2020. Journal of Machine Learning Research 21 (89): 1–53.
Janzing, Dominik, Joris Mooij, Kun Zhang, Jan Lemeire, Jakob Zscheischler, Povilas Daniušis, Bastian Steudel, and Bernhard Schölkopf. 2012. Artificial Intelligence 182-183 (May): 1–31.
Jin, Ying, Weilin Fu, Jian Kang, Jiadong Guo, and Jian Guo. 2020. arXiv:1910.08892 [Stat], January.
Jung, Alexander, Nguyen Tran Quang, and Alexandru Mara. 2017. arXiv:1704.02107 [Stat], April.
Khoshgnauz, Ehsan. 2012. arXiv:1206.6361 [Cs, Stat], June.
Kocaoglu, Murat, Alex Dimakis, and Sriram Vishwanath. 2017. In PMLR, 1875–84.
Kocaoglu, Murat, Christopher Snyder, Alexandros G. Dimakis, and Sriram Vishwanath. 2017. arXiv:1709.02023 [Cs, Math, Stat], September.
Krämer, Nicole, Juliane Schäfer, and Anne-Laure Boulesteix. 2009. BMC Bioinformatics 10 (1): 384.
Kuipers, Jack, Giusi Moffa, and David Heckerman. 2014a. The Annals of Statistics, 9.
———. 2014b. The Annals of Statistics 42 (4).
Lederer, Johannes. 2016. arXiv:1609.05551 [Math, Stat], September.
Lee, Su-In, Varun Ganapathi, and Daphne Koller. 2006. In Advances in Neural Information Processing Systems, 817–24. MIT Press.
Li, Yunzhu, Antonio Torralba, Animashree Anandkumar, Dieter Fox, and Animesh Garg. 2020. arXiv:2007.00631 [Cs, Stat], July.
Liu, Han, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. 2012. arXiv:1206.6488 [Cs, Stat], June.
Liu, Han, John Lafferty, and Larry Wasserman. 2009. Journal of Machine Learning Research 10 (December): 2295–2328.
Liu, Qiang, and Dilin Wang. 2019. In Advances In Neural Information Processing Systems.
Locatello, Francesco, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. 2019. arXiv:1811.12359 [Cs, Stat], June.
Lorch, Lars, Jonas Rothfuss, Bernhard Schölkopf, and Andreas Krause. 2021. In.
Maathuis, Marloes H., Diego Colombo, Markus Kalisch, and Peter Bühlmann. 2010. Nature Methods 7 (4): 247–48.
Maathuis, Marloes H., Markus Kalisch, and Peter Bühlmann. 2009. The Annals of Statistics 37 (6A): 3133–64.
Mansinghka, Vikash, Charles Kemp, Thomas Griffiths, and Joshua Tenenbaum. 2012. arXiv:1206.6852, June.
Mazumder, Rahul, and Trevor Hastie. 2012. Electronic Journal of Statistics 6 (November): 2125–49.
Montanari, Andrea. 2012. Compressed Sensing: Theory and Applications, 394–438.
Mooij, Joris M., Jonas Peters, Dominik Janzing, Jakob Zscheischler, and Bernhard Schölkopf. 2016. Journal of Machine Learning Research 17 (32): 1–102.
Nair, Suraj, Yuke Zhu, Silvio Savarese, and Li Fei-Fei. 2019. arXiv:1910.01751 [Cs, Stat], October.
Narendra, Tanmayee, Anush Sankaran, Deepak Vijaykeerthy, and Senthil Mani. 2018. arXiv:1811.04376 [Cs, Stat], November.
Nauta, Meike, Doina Bucur, and Christin Seifert. 2019. Machine Learning and Knowledge Extraction 1 (1): 312–40.
Neapolitan, Richard E. 2003. Learning Bayesian Networks. Vol. 38. Prentice Hal, Paperback.
Ng, Ignavier, Zhuangyan Fang, Shengyu Zhu, Zhitang Chen, and Jun Wang. 2020. arXiv:1910.08527 [Cs, Stat], February.
Ng, Ignavier, Shengyu Zhu, Zhitang Chen, and Zhuangyan Fang. 2019. In Advances In Neural Information Processing Systems.
Obermeyer, Fritz, Eli Bingham, Martin Jankowiak, Du Phan, and Jonathan P. Chen. 2020. arXiv:1910.10775 [Cs, Stat], March.
Peters, Jonas, Joris Mooij, Dominik Janzing, and Bernhard Schoelkopf. 2012. arXiv:1202.3757 [Cs, Stat], February.
Ramsey, Joseph, Madelyn Glymour, Ruben Sanchez-Romero, and Clark Glymour. 2017. International Journal of Data Science and Analytics 3 (2): 121–29.
Schelldorfer, Jürg, Lukas Meier, and Peter Bühlmann. 2014. Journal of Computational and Graphical Statistics 23 (2): 460–77.
Scutari, Marco. 2018. arXiv:1708.00689 [Math, Stat], March.
Scutari, Marco, Catharina Elisabeth Graafland, and José Manuel Gutiérrez. 2019. arXiv:1805.11908 [Stat], July.
Textor, Johannes, Alexander Idelberger, and Maciej Liśkiewicz. 2015. arXiv:1508.00280 [Cs], August.
Wu, Rui, R. Srikant, and Jian Ni. 2012. “Learning Graph Structures in Discrete Markov Random Fields.” In INFOCOM Workshops, 214–19.
Yang, Mengyue, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. 2020. arXiv:2004.08697 [Cs, Stat], July.
Yu, Yue, Jie Chen, Tian Gao, and Mo Yu. 2019. Proceedings of the 36th International Conference on Machine Learning. PMLR.
Zhao, Tuo, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. 2012. Journal of Machine Learning Research : JMLR 13 (April): 1059–62.
Zheng, Xun, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. 2018. In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 9472–83. Curran Associates, Inc.
Zheng, Xun, Chen Dan, Bryon Aragam, Pradeep Ravikumar, and Eric Xing. 2020. In International Conference on Artificial Intelligence and Statistics, 3414–25. PMLR.
Zhou, Mingyuan, Yulai Cong, and Bo Chen. 2017. “Augmentable Gamma Belief Networks,” 44.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.