I’m thinking something through for myself. Details are absent right now. Twin to Causality+ML, perhaps.
Various ways of partitioning a vector of observed and unobserved variates, and their relationships to graphical models and the ML distinction of supervised/unsupervised learning, and the generalisations we might want to make these go.
What is supervised learning?
We look to Bayesian inference to solve problems with the following structure: I want to infer the probability of some noisy label given some predictor . Since this is data-driven learning, I suppose that I want to work out how to do this from a dataset of labelled examples, . So in fact, I want to know .
We are using lazy Bayesian density notation where is the density of some RV . I feel ambivalent about this notation, but it does get the job done and soothes nervous ML conference reviewers. In the name of picking my battles, I run with it. We can throw out the densities later and go to nice clean unambiguous measures, but this helps exposition.
How do I compute this equation-in-densities? Let’s suppose that by some miracle it happens to be true that we know the generating process , where is some unobserved noise, and there is a finite vector of parameters .
I hope to find good values for such that is a good model for the data, in that if we take our model out into the world and show it lots of different values for we will observe that the implied , pumped through , describes the distribution of .
How do we find ? There is the graphical model formalism which (details to come, trust me) tells us how to infer stuff about from : Suppose . Then the following graphical model describes our task:
The circled node is what we are trying to get “correct”. By looking at those pairs we refine our estimate of . I am leaving out the details of what “refining” the estimate means; we will come back to that.
OK, we talked about refining . What does that look like in practice? In the Bayes setting, we condition a prior on the observed data , to obtain the posterior distribution . I am pretty sure we can alternatively do this as a frequentist thing, but TBH I am just not smart enough; the phrasing gets very weird, and we need to use all kinds of circumlocutions so it is not very clear, at least for my modest-sized brain what the hell is going on.
According to Bayes’ rule, we can write
The numerator can be further factorized due to the conditional independencies implied by the graphical model:
The denominator, which is the marginal likelihood, can be obtained by integrating out :
Putting it all together, the posterior distribution of is:
This equation represents the posterior distribution of in terms of the densities of the observed and latent variables, conditioned on the observed data.
This is misleading, since I said before we wanted a that gave us good predictions for . Maybe is just a nuisance parameter and we don’t really care about it as long as we get good predictions for . Maybe we actually want to solve a problem like this, and that is a nuisance variable:
In ML, that is usually what we are doing; is just some parameters rather than being something with intrinsic, semi-universal meaning, like the speed of light or the boiling point of lead in a vacuum. Our true target is to get this , which we give the special name, posterior predictive distribution.
Let us replay all that conditioning mathematics, but make the target , the posterior given , and the observed data , i.e.
Since information about and about is mediated wholly through , we might like to think about that calculation in terms of a posterior ,
Hold on to that thought, because the idea of dealing with “conditioned” variables turns out to take us somewhere interesting.
If we want to integrate out to get only the marginal posterior predictive distribution, we do this:
Where is the posterior distribution of given the observed data, which can be computed using Bayes’ rule as we did above.
I’m getting tired of writing out the indexed pairs, so let’s just write for the observed data, and for the unobserved data, and use plate notation to write a graph that automatically indexes over these variables:
OK, now in a Bayes setting we talk about the marginal of interest. The red line denotes the marginal of interest.
What is unsupervised learning?
As seen in unconditional generative modelling. Now we don’t assume that we will observe a special and find , but rather we want to know something about the distribution of both jointly
In fact, we are treating and symmetrically now, so we might as well concatenate them both into :
What are inverse problems?
I observe an output. What is my “posterior retrodictive”? That is, can I estimate what caused the current observation, given my knowledge of similar problems?
This is a very common problem in science, and I do a bit of it myself. See inverse problems for a notebook on that, especially functional inverse problems..
Reality gaps
One problem that pops up often in AI for science is the so-called reality gap problem Figure 8. There are a few interpretations/definitions of the reality gap problem, but I’d argue most of the ones I have seen are mild variants of the one I give here.
The basic setting is as follows: We have plentiful data from a simulator which predicts output given input . We believe that if we can get a good model of the simulator, we can use it to predict the output given input . But we believe the simulator is miscalibrated to the world in some sense. Maybe it doesn’t include all the physics or there is some unobserved noise perturbing the “true” system. So we also want to add influence from the real data observations . But of course, if it’s real data we don’t have noiseless observations of . We have noisy observations , because we had to measure that data in the real world, and we have some noise in our measurements. Can we use the noisy observations of the real data to help us learn a better approximation of the simulator, so it can also predict the output given noisy inputs?
This is all terribly abstract, and possibly better explained with a concrete use case:
Suppose I have a great simulator of the weather, but it’s too expensive to run. So I train a neural network to approximate that great simulator, and now it is cheaper to run, but I still don’t know the weather, because both the weather simulator and the neural network can only predict the weather perfectly if they know everything about the world; every little eddy and vortex in the atmosphere. And I cannot know that. But I can look at satellite photos and rain radar etc. These are noisy observations of the weather. Can I use those to make a noisy neural network that predicts the weather better? Boom! An emulation problem.
I solved one of these with GEnBP.
Hierarchical models
Hierarchical models. We observe things only indirectly; what can we know about deeply hidden values Figure 9? These models are beloved of social scientists and econometricians.
Causally valid inference
TBD
Factor graphs
Causal models are great, but exact calculations using them are typically painful to calculate. Usually we make do with variational approximations to the true target. The natural graph type for these questions ends up being factor graphs which conveniently encode some useful approximate marginalisation algorithms.
The way that these work is we rewrite every conditional probability in the directed graph as a factor:
Now, we introduce a new set of nodes, the factors, which are the , and we connect them to the implicated variable nodes.
Why? Everyone asks me why. My answer for now is that it comes out easier that way so suck it up. The factor graph is a representation of the calculations we need to do, not the interpretation that we make. I wish I had a deeper answer.
Here is the supervised inference problem as a factor graph:
Here is that inverse problem Figure 7 as a factor graph.
We can spend a lot of time circumlocuting our concerns about when we can actually update over such a graph, i.e. when we can actually do the calculations. The answer is complicated, but we are neural network people, so we’ll typically just YOLO it and see what happens.
BP as generalized conditioning
👷
BP as variational approximation
Our goal in variational inference is to find an equation in beliefs, which are approximate marginals, and hope that we can calculate them in such a way that they end up good approximations for true marginals.
We want to represent every quantity of interest in terms of marginals at each node, and then solve for those marginals. If we can find some fixed point iteration that is local in some sense, and promises convergence to something useful, then we can declare victory.
To touch upon
- Dense non-causal connections
- re-factorisation
- inference estimate
- (approximate) independence and identifiability are simpler as variational approximation problems
References
Attias. 1999.
“Inferring Parameters and Structure of Latent Variable Models by Variational Bayes.” In
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence. UAI’99.
Blei, Kucukelbir, and McAuliffe. 2017.
“Variational Inference: A Review for Statisticians.” Journal of the American Statistical Association.
Cox, van de Laar, and de Vries. 2019.
“A Factor Graph Approach to Automated Design of Bayesian Signal Processing Algorithms.” International Journal of Approximate Reasoning.
Dauwels. 2007.
“On Variational Message Passing on Factor Graphs.” In
2007 IEEE International Symposium on Information Theory.
Jordan. 2004.
“Graphical Models.” Statistical Science.
Jordan, Ghahramani, Jaakkola, et al. 1999.
“An Introduction to Variational Methods for Graphical Models.” Machine Learning.
Kschischang, Frey, and Loeliger. 2001.
“Factor Graphs and the Sum-Product Algorithm.” IEEE Transactions on Information Theory.
Kuck, Chakraborty, Tang, et al. 2020.
“Belief Propagation Neural Networks.” arXiv:2007.00295 [Cs, Stat].
Li, and Turner. 2016.
“Rényi Divergence Variational Inference.” In
Advances in Neural Information Processing Systems.
Malioutov, Johnson, and Willsky. 2006.
“Walk-Sums and Belief Propagation in Gaussian Graphical Models.” Journal of Machine Learning Research.
Minka. 2001.
“Expectation Propagation for Approximate Bayesian Inference.” In
Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01.
Naesseth, Lindsten, and Schön. 2014.
“Sequential Monte Carlo for Graphical Models.” In
Advances in Neural Information Processing Systems.
Nguyen, and Bonilla. 2014.
“Automated Variational Inference for Gaussian Process Models.” In
Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 1. NIPS’14.
Ortiz, Evans, and Davison. 2021.
“A Visual Introduction to Gaussian Belief Propagation.” arXiv:2107.02308 [Cs].
Pearl. 2009. Causality: Models, Reasoning and Inference.
Roychowdhury, and Kulis. 2015.
“Gamma Processes, Stick-Breaking, and Variational Inference.” In
Artificial Intelligence and Statistics.
Wainwright, and Jordan. 2008.
Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning.
Welling, Minka, and Teh. 2012.
“Structured Region Graphs: Morphing EP into GBP.” In
Uncertainty in Artificial Intelligence.
Wiegerinck, and Heskes. 2002.
“Fractional Belief Propagation.” In
Advances in Neural Information Processing Systems.
Winn, and Bishop. 2005.
“Variational Message Passing.” In
Journal of Machine Learning Research.
Xing, Jordan, and Russell. 2003.
“A Generalized Mean Field Algorithm for Variational Inference in Exponential Families.” In
Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. UAI’03.
Yedidia, Jonathan S., Freeman, and Weiss. 2000. “Generalized Belief Propagation.” In Proceedings of the 13th International Conference on Neural Information Processing Systems. NIPS’00.
Yedidia, J.S., Freeman, and Weiss. 2003.
“Understanding Belief Propagation and Its Generalizations.” In
Exploring Artificial Intelligence in the New Millennium.