Bayesian deep learning



Probably approximately a horse

WARNING: more than usually chaotic notes here

Bayesian inference for massively parameterised networks.

To learn:

  • marginal likelihood in model selection: how does it work with many optima?

Closely related: Generative models where we train a process to generate the phenomenon of interest.

Backgrounders

Radford Neal’s thesis (Neal 1996) is a foundational asymptotically-Bayesian use of neural networks. Yarin Gal’s PhD Thesis (Gal 2016) summarizes some implicit approximate approaches (e.g. the Bayesian interpretation of dropout. Diederik P. Kingma’s thesis is a blockbuster in this tradition.

Alex Graves did a poster of his paper (Graves 2011) of a simplest prior uncertainty thing for recurrent nets - (diagonal Gaussian weight uncertainty) There is a 3rd party quick and dirty implementation.

One could refer to the 2019 NeurIPS Bayes deep learning workshop site which will have some more modern positioning. There was a tutorial in 2020: by Dustin Tran, Jasper Snoek, Balaji Lakshminarayanan: Practical Uncertainty Estimation & Out-of-Distribution Robustness in Deep Learning.

Generative methods are useful here, e.g. the variational autoencoder and affiliated reparameterization trick. Likelihood free methods seems to be in the air too.

Sampling from

TBD

Stochastic Gradient Descent

I think this argument is leveraged in Neal (1996). But see the version in Mandt, Hoffman, and Blei (2017) for a highly developed version

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results.

  1. We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions.
  2. We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models.
  3. We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly.
  4. We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally,
  5. we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

Ensemble methods

Deep learning has its own twists on model averaging and bagging: Neural ensembles.

Practicalities

The computational toolsets for “neural” probabilistic programming and vanilla probabilistic programming are converging. See the tool listing under probabilistic programming.

References

Baydin, Atılım Güneş, Lei Shao, Wahid Bhimji, Lukas Heinrich, Lawrence Meadows, Jialin Liu, Andreas Munk, et al. 2019. “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale.” In. http://arxiv.org/abs/1907.03382.
Doerr, Andreas, Christian Daniel, Martin Schiegg, Duy Nguyen-Tuong, Stefan Schaal, Marc Toussaint, and Sebastian Trimpe. 2018. “Probabilistic Recurrent State-Space Models.” January 31, 2018. http://arxiv.org/abs/1801.10395.
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” November 30, 2020. http://arxiv.org/abs/2012.00152.
Dupont, Emilien, Arnaud Doucet, and Yee Whye Teh. 2019. “Augmented Neural ODEs.” April 2, 2019. http://arxiv.org/abs/1904.01681.
Dutordoir, Vincent, James Hensman, Mark van der Wilk, Carl Henrik Ek, Zoubin Ghahramani, and Nicolas Durrande. 2021. “Deep Neural Networks as Point Estimates for Deep Gaussian Processes.” May 10, 2021. http://arxiv.org/abs/2105.04504.
Eleftheriadis, Stefanos, Tom Nicholson, Marc Deisenroth, and James Hensman. 2017. “Identification of Gaussian Process State Space Models.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 5309–19. Curran Associates, Inc. http://papers.nips.cc/paper/7115-identification-of-gaussian-process-state-space-models.pdf.
Fabius, Otto, and Joost R. van Amersfoort. 2014. “Variational Recurrent Auto-Encoders.” In Proceedings of ICLR. http://arxiv.org/abs/1412.6581.
Foong, Andrew Y K, David R Burt, Yingzhen Li, and Richard E Turner. 2019. “Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural Networks.” In 4th Workshop on Bayesian Deep Learning (NeurIPS 2019), 17.
Gal, Yarin. 2015. “Rapid Prototyping of Probabilistic Models: Emerging Challenges in Variational Inference.” In Advances in Approximate Bayesian Inference Workshop, NIPS.
———. 2016. “Uncertainty in Deep Learning.” University of Cambridge.
Gal, Yarin, and Zoubin Ghahramani. 2015a. “On Modern Deep Learning and Variational Inference.” In Advances in Approximate Bayesian Inference Workshop, NIPS.
———. 2015b. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning (ICML-16). http://arxiv.org/abs/1506.02142.
———. 2016a. “Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference.” In 4th International Conference on Learning Representations (ICLR) Workshop Track. http://arxiv.org/abs/1506.02158.
———. 2016b. “Dropout as a Bayesian Approximation: Appendix.” May 25, 2016. http://arxiv.org/abs/1506.02157.
Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. “Concrete Dropout.” May 22, 2017. http://arxiv.org/abs/1705.07832.
Garnelo, Marta, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J. Rezende, and S. M. Ali Eslami. 2018. “Conditional Neural Processes.” July 4, 2018. https://arxiv.org/abs/1807.01613v1.
Giryes, R., G. Sapiro, and A. M. Bronstein. 2016. “Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?” IEEE Transactions on Signal Processing 64 (13): 3444–57. https://doi.org/10.1109/TSP.2016.2546221.
Graves, Alex. 2011. “Practical Variational Inference for Neural Networks.” In Proceedings of the 24th International Conference on Neural Information Processing Systems, 2348–56. NIPS’11. USA: Curran Associates Inc. https://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.
Gu, Shixiang, Zoubin Ghahramani, and Richard E Turner. 2015. “Neural Adaptive Sequential Monte Carlo.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2629–37. Curran Associates, Inc. http://papers.nips.cc/paper/5961-neural-adaptive-sequential-monte-carlo.pdf.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/0b1ec366924b26fc98fa7b71a9c249cf-Abstract.html.
Hu, Zhiting, Zichao Yang, Ruslan Salakhutdinov, and Eric P. Xing. 2018. “On Unifying Deep Generative Models.” In. http://arxiv.org/abs/1706.00550.
Karl, Maximilian, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. 2016. “Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data.” In Proceedings of ICLR. http://arxiv.org/abs/1605.06432.
Kingma, Diederik P. 2017. “Variational Inference & Deep Learning: A New Synthesis.” https://www.dropbox.com/s/v6ua3d9yt44vgb3/cover_and_thesis.pdf?dl=0.
Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding Variational Bayes.” In ICLR 2014 Conference. http://arxiv.org/abs/1312.6114.
Kristiadi, Agustinus, Matthias Hein, and Philipp Hennig. 2020. “Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks.” July 17, 2020. http://arxiv.org/abs/2002.10118.
———. 2021. “Learnable Uncertainty Under Laplace Approximations.” June 7, 2021. http://arxiv.org/abs/2010.02720.
Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. “Deep Neural Networks as Gaussian Processes.” In ICLR. http://arxiv.org/abs/1711.00165.
Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2017. “Bayesian Sparsification of Recurrent Neural Networks.” In Workshop on Learning to Generate Natural Language. http://arxiv.org/abs/1708.00077.
Louizos, Christos, and Max Welling. 2016. “Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors.” In, 1708–16. http://arxiv.org/abs/1603.04733.
———. 2017. “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR, 2218–27. http://proceedings.mlr.press/v70/louizos17a.html.
MacKay, David J C. 2002. Information Theory, Inference & Learning Algorithms. Cambridge University Press.
MacKay, David J. C. 1992. “A Practical Bayesian Framework for Backpropagation Networks.” Neural Computation 4 (3): 448–72. https://doi.org/10.1162/neco.1992.4.3.448.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April. http://arxiv.org/abs/1704.04289.
Matthews, Alexander Graeme de Garis, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahramani. 2018. “Gaussian Process Behaviour in Wide Deep Neural Networks.” August 16, 2018. http://arxiv.org/abs/1804.11271.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML. http://arxiv.org/abs/1701.05369.
Neal, Radford M. 1996. “Bayesian Learning for Neural Networks.” Secaucus, NJ, USA: Springer-Verlag New York, Inc. http://www.csri.utoronto.ca/~radford/ftp/thesis.pdf.
Peluchetti, Stefano, and Stefano Favaro. 2020. “Infinitely Deep Neural Networks as Diffusion Processes.” In International Conference on Artificial Intelligence and Statistics, 1126–36. PMLR. http://proceedings.mlr.press/v108/peluchetti20a.html.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, Mass: Max-Planck-Gesellschaft; MIT Press. http://www.gaussianprocess.org/gpml/.
Wen, Yeming, Dustin Tran, and Jimmy Ba. 2020. BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning.” In ICLR. http://arxiv.org/abs/2002.06715.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.