Graphical model / machine learning decoder ring

2024-01-18 — 2025-04-16

algebra

graphical models

hidden variables

hierarchical models

how do science

machine learning

networks

neural nets

probability

statistics

when to compute

I’m thinking something through for myself. Details are absent right now. Twin to Causality+ML, perhaps.

Various ways of partitioning a vector of observed and unobserved variates, and their relationships to graphical models and the ML distinction of supervised/unsupervised learning, and the generalisations we might want to make these go.

1 What is supervised learning?

We look to Bayesian inference to solve problems with the following structure: I want to infer the probability of some noisy label \(Y_*\) given some predictor \(X_*\). Since this is data-driven learning, I suppose that I want to work out how to do this from a dataset of labelled examples, \(\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N\). So in fact, I want to know \(p(y_*|x_*,\mathcal{D})\).

We are using lazy Bayesian density notation where \(p(z)\) is the density of some RV \(Z\). I feel ambivalent about this notation, but it does get the job done and soothes nervous ML conference reviewers. In the name of picking my battles, I run with it. We can throw out the densities later and go to nice clean unambiguous measures, but this helps exposition.

How do I compute this equation-in-densities? Let’s suppose that by some miracle it happens to be true that we know the generating process \(Y= G(X, W, \xi)\), where \(\xi\) is some unobserved noise, and there is a finite vector of parameters \(W\).

I hope to find good values for \(W\) such that \(p(y_*|x_*,W,\mathcal{D})\) is a good model for the data, in that if we take our model out into the world and show it lots of different values for \(x_*=X_*\) we will observe that the implied \(p(y_*|X_*=x_*,W,\mathcal{D})\), pumped through \(G\), describes the distribution of \(Y_*|X_*=x_*\).

How do we find \(W\)? There is the graphical model formalism which (details to come, trust me) tells us how to infer stuff about \(W\) from \(\mathcal{D}\): Suppose \(N=3\). Then the following graphical model describes our task:

The circled \(W\) node is what we are trying to get “correct”. By looking at those \(X_i,Y_i\) pairs we refine our estimate of \(W\). I am leaving out the details of what “refining” the estimate means; we will come back to that.

OK, we talked about refining \(W\). What does that look like in practice? In the Bayes setting, we condition a prior \(p(w)\) on the observed data \(\mathcal{D}\), to obtain the posterior distribution \(p(w|\mathcal{D})=p(w|X_1, X_2, X_3, Y_1, Y_2, Y_3)\). I am pretty sure we can alternatively do this as a frequentist thing, but TBH I am just not smart enough; the phrasing gets very weird, and we need to use all kinds of circumlocutions so it is not very clear, at least for my modest-sized brain what the hell is going on.

According to Bayes’ rule, we can write

\[ p(w \mid x_1, x_2, x_3, y_1, y_2, y_3) = \frac{p(y_1, y_2, y_3 \mid x_1, x_2, x_3, w) \, p(w)}{p(y_1, y_2, y_3 \mid x_1, x_2, x_3)} \]

The numerator can be further factorized due to the conditional independencies implied by the graphical model:

\[ p(y_1, y_2, y_3 \mid x_1, x_2, x_3, w) \, p(w) = p(y_1 \mid x_1, w) \, p(y_2 \mid x_2, w) \, p(y_3 \mid x_3, w) \, p(w) \]

The denominator, which is the marginal likelihood, can be obtained by integrating out \(W\):

\[ p(y_1, y_2, y_3 \mid x_1, x_2, x_3) = \int p(y_1 \mid x_1, w) \, p(y_2 \mid x_2, w) \, p(y_3 \mid x_3, w) \, p(w) \, dw \]

Putting it all together, the posterior distribution of \(W\) is:

\[ \begin{aligned} p(w \mid x_1, x_2, x_3, y_1, y_2, y_3) &= \frac{p(y_1 \mid x_1, w) \, p(y_2 \mid x_2, w) \, p(y_3 \mid x_3, w) \, p(w)}{\int p(y_1 \mid x_1, w) \, p(y_2 \mid x_2, w) \, p(y_3 \mid x_3, w) \, p(w) \, dw}\\ &= \frac{p(y_1 \mid x_1, w) \, p(y_2 \mid x_2, w) \, p(y_3 \mid x_3, w) \, p(w)}{\int p(y_1 \mid x_1, w) \, p(y_2 \mid x_2, w) \, p(y_3 \mid x_3, w) \, p(w) \, dw} \end{aligned} \]

This equation represents the posterior distribution of \(W\) in terms of the densities of the observed and latent variables, conditioned on the observed data.

This is misleading, since I said before we wanted a \(W\) that gave us good predictions for \(Y_*|X_*\). Maybe \(W\) is just a nuisance parameter and we don’t really care about it as long as we get good predictions for \(Y_*\). Maybe we actually want to solve a problem like this, and that \(W\) is a nuisance variable:

In ML, that is usually what we are doing; \(W\) is just some parameters rather than being something with intrinsic, semi-universal meaning, like the speed of light or the boiling point of lead in a vacuum. Our true target is to get this \(p(y_*|x_*,\mathcal{D})\), which we give the special name, posterior predictive distribution.

Let us replay all that conditioning mathematics, but make the target \(p(y_*|x_*,\mathcal{D})\), the posterior \(Y_*\) given \(X_*\), and the observed data \(X_1, Y_1, \ldots,\), i.e.

\[ p(y_* \mid x_*, w, x_1, y_1, \ldots, ) \]

Since information about \(Y_i\) and \(X_i\) about \(Y_*\) is mediated wholly through \(W\), we might like to think about that calculation in terms of a posterior \(W\),

\[ p(y_* \mid x_*, (W|\mathcal{D}))=:p(y_* \mid x_*, x_1, y_1, \ldots) \]

Hold on to that thought, because the idea of dealing with “conditioned” variables turns out to take us somewhere interesting.

If we want to integrate out \(W\) to get only the marginal posterior predictive distribution, we do this:

\[ p(y_* \mid x_*, x_1, y_1, \ldots) = \int p(y_* \mid x_*, w) \, p(w \mid x_1, y_1, \ldots) \, dw \]

Where \(p(w \mid x_1, y_1, \ldots, )\) is the posterior distribution of \(W\) given the observed data, which can be computed using Bayes’ rule as we did above.

I’m getting tired of writing out the indexed \(X_i,Y_i\) pairs, so let’s just write \(\mathcal{D}\) for the observed data, and \(X_*,Y_*\) for the unobserved data, and use plate notation to write a graph that automatically indexes over these variables:

OK, now in a Bayes setting we talk about the marginal of interest. The red line denotes the marginal of interest.

2 What is unsupervised learning?

As seen in unconditional generative modelling. Now we don’t assume that we will observe a special \(X_*\) and find \(Y_*\), but rather we want to know something about the distribution of both jointly

In fact, we are treating \(X\) and \(Y\) symmetrically now, so we might as well concatenate them both into \(X\):

3 What are inverse problems?

I observe an output. What is my “posterior retrodictive”? That is, can I estimate what caused the current observation, given my knowledge of similar problems?

This is a very common problem in science, and I do a bit of it myself. See inverse problems for a notebook on that, especially functional inverse problems.

4 Reality gaps

One problem that pops up often in AI for science is the so-called reality gap problem Figure 8. There are a few interpretations/definitions of the reality gap problem, but I’d argue most of the ones I have seen are mild variants of the one I give here.

The basic setting is as follows: We have plentiful data from a simulator which predicts output \(\widehat{Y}\) given input \(\widehat{X}\). We believe that if we can get a good model of the simulator, we can use it to predict the output \(Y_*\) given input \(X_*\). But we believe the simulator is miscalibrated to the world in some sense. Maybe it doesn’t include all the physics or there is some unobserved noise perturbing the “true” system. So we also want to add influence from the real data observations \(X_n,Y_n\). But of course, if it’s real data we don’t have noiseless observations of \(X_n,Y_n\). We have noisy observations \(\tilde{X}_n,\tilde{Y}_n\), because we had to measure that data in the real world, and we have some noise in our measurements. Can we use the noisy observations of the real data to help us learn a better approximation of the simulator, so it can also predict the output given noisy inputs?

This is all terribly abstract, and possibly better explained with a concrete use case:

Suppose I have a great simulator of the weather, but it’s too expensive to run. So I train a neural network to approximate that great simulator, and now it is cheaper to run, but I still don’t know the weather, because both the weather simulator and the neural network can only predict the weather perfectly if they know everything about the world; every little eddy and vortex in the atmosphere. And I cannot know that. But I can look at satellite photos and rain radar etc. These are noisy observations of the weather. Can I use those to make a noisy neural network that predicts the weather better? Boom! An emulation problem.

I solved one of these with GEnBP.

Figure 8: The reality gap. Given an emulator with parameters \(W\) trained off a combination of noisy observations \((\widehat{X}_n,\widehat{Y}_n)\) of ground truth \((X_n,Y_n)\) pairs, and synthetic observations \((\widehat{X}_n,\widehat{Y}_n)\), can we predict the value of output \(Y_*\) given noisy observations \(\tilde{X}_*\) of its input \(X_*\)?

5 Hierarchical models

Hierarchical models. We observe things only indirectly; what can we know about deeply hidden values Figure 9? These models are beloved of social scientists and econometricians.

Figure 9: One of many hierarchical models. I indirectly observe something of interest \(Y\), how can I learn about it? Estimating stuff like intelligence or criminality often ends up looking like this if you think about it.

6 Causally valid inference

TBD

7 Factor graphs

Causal models are great, but exact calculations using them are typically painful to calculate. Usually we make do with variational approximations to the true target. The natural graph type for these questions ends up being factor graphs which conveniently encode some useful approximate marginalisation algorithms.

The way that these work is we rewrite every conditional probability in the directed graph as a factor:

\[ p(x_1|x_2) \to f_{12}(x_1, x_2) \]

Now, we introduce a new set of nodes, the factors, which are the \(f_{ij}\), and we connect them to the implicated variable nodes.

Why? Everyone asks me why. My answer for now is that it comes out easier that way so suck it up. The factor graph is a representation of the calculations we need to do, not the interpretation that we make. I wish I had a deeper answer.

Here is the supervised \(W\) inference problem as a factor graph:

Here is that inverse problem (Figure 7) as a factor graph.

We can spend a lot of time circumlocuting our concerns about when we can actually update over such a graph, i.e. when we can actually do the calculations. The answer is complicated, but we are neural network people, so we’ll typically just YOLO it and see what happens.

7.1 BP as generalized conditioning

👷

7.2 BP as variational approximation

Our goal in variational inference is to find an equation in beliefs, which are approximate marginals, and hope that we can calculate them in such a way that they end up good approximations for true marginals.

We want to represent every quantity of interest in terms of marginals at each node, and then solve for those marginals. If we can find some fixed point iteration that is local in some sense, and promises convergence to something useful, then we can declare victory.

8 How about neural networks?

Great question.

9 To touch upon

Dense non-causal connections
re-factorisation
inference estimate
(approximate) independence and identifiability are simpler as variational approximation problems

10 Incoming

Reinforcement learning is supervised learning on optimized data

11 References

Akbayrak, Bocharov, and de Vries. 2021. “Extended Variational Message Passing for Automated Approximate Bayesian Inference.” Entropy.

Attias. 1999. “Inferring Parameters and Structure of Latent Variable Models by Variational Bayes.” In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. UAI’99.

Blei, Kucukelbir, and McAuliffe. 2017. “Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association.

Cox, van de Laar, and de Vries. 2019. “A Factor Graph Approach to Automated Design of Bayesian Signal Processing Algorithms.” International Journal of Approximate Reasoning.

Dauwels. 2007. “On Variational Message Passing on Factor Graphs.” In 2007 IEEE International Symposium on Information Theory.

Jordan. 2004. “Graphical Models.” Statistical Science.

Jordan, Ghahramani, Jaakkola, et al. 1999. “An Introduction to Variational Methods for Graphical Models.” Machine Learning.

Kschischang, Frey, and Loeliger. 2001. “Factor Graphs and the Sum-Product Algorithm.” IEEE Transactions on Information Theory.

Kuck, Chakraborty, Tang, et al. 2020. “Belief Propagation Neural Networks.” arXiv:2007.00295 [Cs, Stat].

Li, and Turner. 2016. “Rényi Divergence Variational Inference.” In Advances in Neural Information Processing Systems.

Malioutov, Johnson, and Willsky. 2006. “Walk-Sums and Belief Propagation in Gaussian Graphical Models.” Journal of Machine Learning Research.

Minka. 2001. “Expectation Propagation for Approximate Bayesian Inference.” In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.

Naesseth, Lindsten, and Schön. 2014. “Sequential Monte Carlo for Graphical Models.” In Advances in Neural Information Processing Systems.

Nguyen, and Bonilla. 2014. “Automated Variational Inference for Gaussian Process Models.” In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS’14.

Ortiz, Evans, and Davison. 2021. “A Visual Introduction to Gaussian Belief Propagation.” arXiv:2107.02308 [Cs].

Pearl. 2009. Causality: Models, Reasoning and Inference.

Roychowdhury, and Kulis. 2015. “Gamma Processes, Stick-Breaking, and Variational Inference.” In Artificial Intelligence and Statistics.

Satorras, and Welling. 2021. “Neural Enhanced Belief Propagation on Factor Graphs.” In.

Wainwright, and Jordan. 2008. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.

Wand. 2017. “Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing.” Journal of the American Statistical Association.

Welling, Minka, and Teh. 2012. “Structured Region Graphs: Morphing EP into GBP.” In Uncertainty in Artificial Intelligence.

Wiegerinck, and Heskes. 2002. “Fractional Belief Propagation.” In Advances in Neural Information Processing Systems.

Wingate, and Weber. 2013. “Automated Variational Inference in Probabilistic Programming.” arXiv:1301.1299 [Cs, Stat].

Winn, and Bishop. 2005. “Variational Message Passing.” In Journal of Machine Learning Research.

Xing, Jordan, and Russell. 2003. “A Generalized Mean Field Algorithm for Variational Inference in Exponential Families.” In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. UAI’03.

Yedidia, Jonathan S., Freeman, and Weiss. 2000. “Generalized Belief Propagation.” In Proceedings of the 13th International Conference on Neural Information Processing Systems. NIPS’00.

Yedidia, J.S., Freeman, and Weiss. 2003. “Understanding Belief Propagation and Its Generalizations.” In Exploring Artificial Intelligence in the New Millennium.

———. 2005. “Constructing Free-Energy Approximations and Generalized Belief Propagation Algorithms.” IEEE Transactions on Information Theory.

Zhou. 2017. “Message Passing for Graphical Models.”