Probabilistic neural nets

Bayesian and other probabilistic inference in overparameterized ML



Inferring densities and distributions in a massively parameterised deep learning setting.

This is not intrinsically a Bayesian thing to do but in practice much of the demand to do probabilistic nets comes from the demand for Bayesian posterior inference for neural nets. Bayesian inference is, however, not the only way to do uncertainty quantification.

Neural networks are very far from simple exponential families where conjugate distributions might help, and so typically rely upon approximations or luck to approximate our true target of interest.

Closely related: Generative models where we train a process to generate a (possibly stochastic) phenomenon of interest.

Backgrounders

Radford Neal’s thesis (Neal 1996) is a foundational Bayesian use of neural networks in the wide NN and MCMC sampling settings. Diederik P. Kingma’s thesis is a blockbuster in the more recent variational tradition.

Alex Graves’ poster of his paper (Graves 2011) of a simplest prior uncertainty thing for recurrent nets - (diagonal Gaussian weight uncertainty) I found elucidating. (There is a 3rd party quick and dirty implementation.)

One could refer to the 2019 NeurIPS Bayes deep learning workshop site which will have some more modern positioning. There was a tutorial in 2020: by Dustin Tran, Jasper Snoek, Balaji Lakshminarayanan: Practical Uncertainty Estimation & Out-of-Distribution Robustness in Deep Learning.

Generative methods are useful here, e.g. the variational autoencoder and affiliated reparameterization trick. Likelihood free methods seems to be in the air too.

We are free to consider classic neural network inference as sort-of a special case of Bayes inference. Specifically, we interpret the loss function \(\mathcal{L}\) of a net \(f:\mathbb{R}^n\times\mathbb{R}^d\to\mathbb{R}^k\) in the likelihood setting \[ \begin{aligned} \mathcal{L}(\theta) &:=-\sum_{i=1}^{m} \log p\left(y_{i} \mid f\left(x_{i} ; \theta\right)\right)-\log p(\theta) \\ &=-\log p(\theta \mid \mathcal{D}). \end{aligned} \]

Obviously a few things are different here; the parameter vector \(\theta\) is not interpretable, so what do posterior distributions over it even mean? What are sensible priors? Choosing priors over by-design-uninterpretable parameters such as NN weights is a whole fraught thing that we will mostly ignore for now. Usually it is by default something like \[ p(\theta)=\mathcal{N}\left(0, \lambda^{-1} I\right) \] for want of a better idea. This ends up being equivalent to the “weight decay” regularisation.

Sweeping those qualms aside, we could do the usual stuff for Bayes inference, like considering the predictive posterior \[ p(y \mid x, \mathcal{D})=\int p(y \mid f(x ; \theta)) p(\theta \mid \mathcal{D}) d \theta \]

Usually this turns out to be intractable to calculate in the very high dimension spaces of an NN, so we encode the net as a simple maximum a posteriori estimate \[ \theta_{\mathrm{MAP}}:=\operatorname{argmin}_{\theta} \mathcal{L}(\theta). \] In this case we have recovered the classic training of non-Bayes nets. But we have no notion of predictive uncertainty in that setting.

Usually the model will also have many symmetries so we know that it has many optima, which makes approximations that leverage particular modes like a MAP estimate.

Somewhere between the full belt-and-braces Bayes approach and the MAP point estimate there are various approximations to Bayes inference we might try.

🏗 To discuss: so many options for predictive uncertainty, but fewer for inverse uncertainty.

What follows is an non-exhaustive smörgåsbord of options to do probabilistic inference in neural nets.

MC sampling of weights by low-rank Matheron updates

Needs a shorter names but looks cool (Ritter et al. 2021).

microsoft/bayesianize: Bayesianize: A Bayesian neural network wrapper in pytorch

Mixture density networks

Nothing to say for now but here are some recommendations I received about this classic (C. Bishop 1994) method.

Variational autoencoders

See variational autoencoders.

Sampling via Monte Carlo

TBD. For now, if the number of parameters is smallish see Hamiltonian Monte Carlo.

Stochastic Gradient Descent as MC inference

I have a vague memory that this argument is leveraged in Neal (1996)? But see the version in Mandt, Hoffman, and Blei (2017) for a highly developed modern take:

Stochastic Gradient Descent with a constant learning rate (constant SGD) simulates a Markov chain with a stationary distribution. With this perspective, we derive several new results.

  1. We show that constant SGD can be used as an approximate Bayesian posterior inference algorithm. Specifically, we show how to adjust the tuning parameters of constant SGD to best match the stationary distribution to a posterior, minimizing the Kullback-Leibler divergence between these two distributions.
  2. We demonstrate that constant SGD gives rise to a new variational EM algorithm that optimizes hyperparameters in complex probabilistic models.
  3. We also propose SGD with momentum for sampling and show how to adjust the damping coefficient accordingly.
  4. We analyze MCMC algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we quantify the approximation errors due to finite learning rates. Finally,
  5. we use the stochastic process perspective to give a short proof of why Polyak averaging is optimal. Based on this idea, we propose a scalable approximate MCMC algorithm, the Averaged Stochastic Gradient Sampler.

A popular recent version of this is the Stochastic Weight Averaging family (Izmailov et al. 2018, 2020; Maddox et al. 2019; Wilson and Izmailov 2020), which I am interested in. See Andrew G Wilson’s web page for a brief description of the sub methods here, since he seems to have been involved in all of them.

Laplace approximation

A Laplace approximation locally approximates the posterior using a Gaussian \[ p(\theta \mid \mathcal{D}) \approx \mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right). \] Such an approach is classic for neural nets (David J. C. MacKay 1992). There are many variants of this technique for different assumptions. Laplace approximations have the attractive feature of providing estimates for forward and inverse problems (Foong et al. 2019; Immer, Korzepa, and Bauer 2021) by leveraging the delta method.

The basic idea is that we hold \(x \in \mathbb{R}^{n}\) fixed and use the Jacobian matrix \(J(x):=\left.\nabla_{\theta} f(x ; \theta)\right|_{\theta_{\mathrm{MAP}}} \in \mathbb{R}^{d \times k}\), to the network as \[ f(x ; \theta) \approx f\left(x ; \theta_{\mathrm{MAP}}\right)+J(x)^{\top}\left(\theta-\theta_{\mathrm{MAP}}\right) \] where the variance is now justifed as a Taylor expansion. Under this approximation, since \(\theta\) is a posteriori distributed as Gaussian \(\mathcal{N}\left(\theta_{\mathrm{MAP}}, \Sigma\right)\), it follows that the marginal distribution over the network output \(f(x)\) is also Gaussian, given by \[ p(f(x) \mid x, \mathcal{D}) \sim \mathcal{N}\left(f\left(x ; \theta_{\mathrm{MAP}}\right), J(x)^{\top} \Sigma J(x)\right). \] For more on this, see (C. M. Bishop 2006, 5.167, 5.188). It is essentially a gratis Laplace approximation in the sense that if I have fit the networks I can already calculate those Jacobians so I am probably 1 line of code away from getting some kind of uncertainty estimate. However, I have no particular guarantees to hope that it is well calibrated, because the simplifications were chosen a priori and might not be appropriate to the combination of model and data that I actually have.

Learnable Laplace approximations

Agustinus Kristiadi and team have created various methods for low-overhead neural uncertainty quantification via Laplace approximation that have greater flexibility for adaptively choosing the type and manner of approximation. See, e.g. Painless Uncertainty for Deep Learning and their papers (Kristiadi, Hein, and Hennig 2020, 2021b).

One interesting variant is that of Kristiadi, Hein, and Hennig (2021b) which generalises to learnable uncertainty to, for example, allow the distribution to reflect uncertainty about outlier datapoints. They define an augmented Learnable Uncertainty Laplace Approximation (LULA) network \(\tilde{f}\) with more parameters \(\tilde{\theta}=\theta_{\mathrm{MAP}}, \hat{\theta}.\)

Let \(f: \mathbb{R}^{n} \times \mathbb{R}^{d} \rightarrow \mathbb{R}^{k}\) be an \(L\)-layer neural network with a MAP-trained parameters \(\theta_{\text {MAP }}\) and let \(\widetilde{f}: \mathbb{R}^{n} \times \mathbb{R}^{\widetilde{d}} \rightarrow \mathbb{R}^{k}\) along with \(\widetilde{\theta}_{\text {MAP }}\) be obtained by adding LULA units. Let \(q(\widetilde{\theta}):=\mathcal{N}\left(\tilde{\theta}_{\mathrm{MAP}}, \widetilde{\Sigma}\right)\) be the Laplace-approximated posterior and \(p\left(y \mid x, \mathcal{D} ; \widetilde{\theta}_{\mathrm{MAP}}\right)\) be the (approximate) predictive distribution under the LA. Furthermore, let us denote the dataset sampled i.i.d. from the data distribution as \(\mathcal{D}_{\text {in }}\) and that from some outlier distribution as \(\mathcal{D}_{\text {out }}\), and let \(H\) be the entropy functional. We construct the following loss function to induce high uncertainty on outliers while maintaining high confidence over the data (inliers): \[ \begin{array}{rl} \mathcal{L}_{\text {LULA }}\left(\widetilde{\theta}_{\text {MAP }}\right)&:=\frac{1}{\left|\mathcal{D}_{\text {in }}\right|} \sum_{x_{\text {in }} \in \mathcal{D}_{\text {in }}} H\left[p\left(y \mid x_{\text {in }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \\ &-\frac{1}{\left|\mathcal{D}_{\text {out }}\right|} \sum_{x_{\text {out }} \in \mathcal{D}_{\text {out }}} H\left[p\left(y \mid x_{\text {out }}, \mathcal{D} ; \widetilde{\theta}_{\text {MAP }}\right)\right] \end{array} \] and minimize it w.r.t. the free parameters \(\widehat{\theta}\).

I am assuming here that by the entropy functional they mean the entropy of the normal distribution, \[ H(\mathcal{N}(\mu, \sigma)) = {\frac {1}{2}}\ln \left((2\pi \mathrm {e} )^{k}\det \left({\boldsymbol {\Sigma }}\right)\right) \] but this looks expensive due to that determinant calculation in a (large) \(d\times d\) matrix. Or possibly they mean some general entropy with respect to some density \(p\) \[H(p)=\mathbb{E}_{p}\left[-\log p( x)\right]\] which I suppose one could estimate as \[H(p)=\frac{1}{N}\sum_{i=1}^N \left[-\log p(x_i)\right]\] without taking that normal Laplace approximation at this step, if we could find the density, and assuming the \(x_i\) were drawn from it.

The result is a slightly weird hybrid fitting procedure that requires two loss functions and which feels a little ad hoc, but maybe it works?

By stochastic weight averaging.

A Bayesian extension of Stochastic Weight Averaging. Izmailov et al. (2018); Izmailov et al. (2020); Maddox et al. (2019); Wilson and Izmailov (2020)

Via random projections

I do not have a single paper about this, but I have seen random projection pop up as a piece of the puzzle in other methods. TBC.

In Gaussian process regression

See kernel learning.

Via measure transport

See reparameterization.

Via infinite-width random nets

See wide NN.

Via NTK

How does this work? He, Lakshminarayanan, and Teh (2020).

Ensemble methods

Deep learning has its own twists on model averaging and bagging: Neural ensembles. Yarin Gal’s PhD Thesis (Gal 2016) summarizes some implicit approximate approaches (e.g. the Bayesian interpretation of dropout) although dropout as he frames it has become highly controversial these days as a means of inference..

Practicalities

The computational toolsets for “neural” probabilistic programming and vanilla probabilistic programming are converging. See the tool listing under probabilistic programming.

References

Abbasnejad, Ehsan, Anthony Dick, and Anton van den Hengel. 2016. “Infinite Variational Autoencoder for Semi-Supervised Learning.” In Advances in Neural Information Processing Systems 29. http://arxiv.org/abs/1611.07800.
Archer, Evan, Il Memming Park, Lars Buesing, John Cunningham, and Liam Paninski. 2015. “Black Box Variational Inference for State Space Models.” arXiv:1511.07367 [stat], November. http://arxiv.org/abs/1511.07367.
Bao, Gang, Xiaojing Ye, Yaohua Zang, and Haomin Zhou. 2020. “Numerical Solution of Inverse Problems by Weak Adversarial Networks.” Inverse Problems 36 (11): 115003. https://doi.org/10.1088/1361-6420/abb447.
Baydin, Atılım Güneş, Lei Shao, Wahid Bhimji, Lukas Heinrich, Lawrence Meadows, Jialin Liu, Andreas Munk, et al. 2019. “Etalumis: Bringing Probabilistic Programming to Scientific Simulators at Scale.” In arXiv:1907.03382 [cs, Stat]. http://arxiv.org/abs/1907.03382.
Bazzani, Loris, Lorenzo Torresani, and Hugo Larochelle. 2017. “Recurrent Mixture Density Network for Spatiotemporal Visual Attention,” 15.
Berg, Rianne van den, Leonard Hasenclever, Jakub M. Tomczak, and Max Welling. 2018. “Sylvester Normalizing Flows for Variational Inference.” In Uai18. http://arxiv.org/abs/1803.05649.
Bishop, Christopher. 1994. “Mixture Density Networks.” Microsoft Research, January. https://www.microsoft.com/en-us/research/publication/mixture-density-networks/.
Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Information Science and Statistics. New York: Springer.
Bora, Ashish, Ajil Jalal, Eric Price, and Alexandros G. Dimakis. 2017. “Compressed Sensing Using Generative Models.” In International Conference on Machine Learning, 537–46. http://arxiv.org/abs/1703.03208.
Bui, Thang D., Sujith Ravi, and Vivek Ramavajjala. 2017. “Neural Graph Machines: Learning Neural Networks Using Graphs.” arXiv:1703.04818 [cs], March. http://arxiv.org/abs/1703.04818.
Castro, Pablo de, and Tommaso Dorigo. 2019. “INFERNO: Inference-Aware Neural Optimisation.” Computer Physics Communications 244 (November): 170–79. https://doi.org/10.1016/j.cpc.2019.06.007.
Chen, Tian Qi, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. 2018. “Neural Ordinary Differential Equations.” In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 6572–83. Curran Associates, Inc. http://papers.nips.cc/paper/7892-neural-ordinary-differential-equations.pdf.
Cutajar, Kurt, Edwin V. Bonilla, Pietro Michiardi, and Maurizio Filippone. 2017. “Random Feature Expansions for Deep Gaussian Processes.” In PMLR. http://proceedings.mlr.press/v70/cutajar17a.html.
Damianou, Andreas, and Neil Lawrence. 2013. “Deep Gaussian Processes.” In Artificial Intelligence and Statistics, 207–15. http://proceedings.mlr.press/v31/damianou13a.html.
Dandekar, Raj, Karen Chung, Vaibhav Dixit, Mohamed Tarek, Aslan Garcia-Valadez, Krishna Vishal Vemula, and Chris Rackauckas. 2021. “Bayesian Neural Ordinary Differential Equations.” arXiv:2012.07244 [cs], March. http://arxiv.org/abs/2012.07244.
Dezfouli, Amir, and Edwin V. Bonilla. 2015. “Scalable Inference for Gaussian Process Models with Black-Box Likelihoods.” In Advances in Neural Information Processing Systems 28, 1414–22. NIPS’15. Cambridge, MA, USA: MIT Press. http://dl.acm.org/citation.cfm?id=2969239.2969397.
Doerr, Andreas, Christian Daniel, Martin Schiegg, Duy Nguyen-Tuong, Stefan Schaal, Marc Toussaint, and Sebastian Trimpe. 2018. “Probabilistic Recurrent State-Space Models.” arXiv:1801.10395 [stat], January. http://arxiv.org/abs/1801.10395.
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” arXiv:2012.00152 [cs, Stat], November. http://arxiv.org/abs/2012.00152.
Dunlop, Matthew M., Mark A. Girolami, Andrew M. Stuart, and Aretha L. Teckentrup. 2018. “How Deep Are Deep Gaussian Processes?” Journal of Machine Learning Research 19 (1): 2100–2145. http://jmlr.org/papers/v19/18-015.html.
Dupont, Emilien, Arnaud Doucet, and Yee Whye Teh. 2019. “Augmented Neural ODEs.” arXiv:1904.01681 [cs, Stat], April. http://arxiv.org/abs/1904.01681.
Dutordoir, Vincent, James Hensman, Mark van der Wilk, Carl Henrik Ek, Zoubin Ghahramani, and Nicolas Durrande. 2021. “Deep Neural Networks as Point Estimates for Deep Gaussian Processes.” arXiv:2105.04504 [cs, Stat], May. http://arxiv.org/abs/2105.04504.
Eleftheriadis, Stefanos, Tom Nicholson, Marc Deisenroth, and James Hensman. 2017. “Identification of Gaussian Process State Space Models.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 5309–19. Curran Associates, Inc. http://papers.nips.cc/paper/7115-identification-of-gaussian-process-state-space-models.pdf.
Fabius, Otto, and Joost R. van Amersfoort. 2014. “Variational Recurrent Auto-Encoders.” In Proceedings of ICLR. http://arxiv.org/abs/1412.6581.
Figurnov, Mikhail, Shakir Mohamed, and Andriy Mnih. 2018. “Implicit Reparameterization Gradients.” In Advances in Neural Information Processing Systems 31, edited by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, 441–52. Curran Associates, Inc. http://papers.nips.cc/paper/7326-implicit-reparameterization-gradients.pdf.
Flunkert, Valentin, David Salinas, and Jan Gasthaus. 2017. “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.” arXiv:1704.04110 [cs, Stat], April. http://arxiv.org/abs/1704.04110.
Foong, Andrew Y. K., Yingzhen Li, José Miguel Hernández-Lobato, and Richard E. Turner. 2019. ‘In-Between’ Uncertainty in Bayesian Neural Networks.” arXiv:1906.11537 [cs, Stat], June. http://arxiv.org/abs/1906.11537.
Gal, Yarin. 2015. “Rapid Prototyping of Probabilistic Models: Emerging Challenges in Variational Inference.” In Advances in Approximate Bayesian Inference Workshop, NIPS.
———. 2016. “Uncertainty in Deep Learning.” University of Cambridge.
Gal, Yarin, and Zoubin Ghahramani. 2015a. “On Modern Deep Learning and Variational Inference.” In Advances in Approximate Bayesian Inference Workshop, NIPS.
———. 2015b. “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning.” In Proceedings of the 33rd International Conference on Machine Learning (ICML-16). http://arxiv.org/abs/1506.02142.
———. 2016a. “A Theoretically Grounded Application of Dropout in Recurrent Neural Networks.” In arXiv:1512.05287 [stat]. http://arxiv.org/abs/1512.05287.
———. 2016b. “Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference.” In 4th International Conference on Learning Representations (ICLR) Workshop Track. http://arxiv.org/abs/1506.02158.
———. 2016c. “Dropout as a Bayesian Approximation: Appendix.” arXiv:1506.02157 [stat], May. http://arxiv.org/abs/1506.02157.
Gal, Yarin, Jiri Hron, and Alex Kendall. 2017. “Concrete Dropout.” arXiv:1705.07832 [stat], May. http://arxiv.org/abs/1705.07832.
Garnelo, Marta, Dan Rosenbaum, Chris J. Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo J. Rezende, and S. M. Ali Eslami. 2018. “Conditional Neural Processes.” arXiv:1807.01613 [cs, Stat], July, 10. https://arxiv.org/abs/1807.01613v1.
Garnelo, Marta, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J. Rezende, S. M. Ali Eslami, and Yee Whye Teh. 2018. “Neural Processes,” July. https://arxiv.org/abs/1807.01622v1.
Gholami, Amir, Kurt Keutzer, and George Biros. 2019. “ANODE: Unconditionally Accurate Memory-Efficient Gradients for Neural ODEs.” arXiv:1902.10298 [cs], February. http://arxiv.org/abs/1902.10298.
Giryes, R., G. Sapiro, and A. M. Bronstein. 2016. “Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?” IEEE Transactions on Signal Processing 64 (13): 3444–57. https://doi.org/10.1109/TSP.2016.2546221.
Gourieroux, C., A. Monfort, and E. Renault. 1993. “Indirect Inference.” Journal of Applied Econometrics 8 (December): S85–118. http://www.jstor.org/stable/2285076.
Graves, Alex. 2011. “Practical Variational Inference for Neural Networks.” In Proceedings of the 24th International Conference on Neural Information Processing Systems, 2348–56. NIPS’11. USA: Curran Associates Inc. https://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf.
———. 2013. “Generating Sequences With Recurrent Neural Networks.” arXiv:1308.0850 [cs], August. http://arxiv.org/abs/1308.0850.
Graves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. “Speech Recognition with Deep Recurrent Neural Networks.” In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. https://doi.org/10.1109/ICASSP.2013.6638947.
Gregor, Karol, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. 2015. “DRAW: A Recurrent Neural Network For Image Generation.” arXiv:1502.04623 [cs], February. http://arxiv.org/abs/1502.04623.
Gu, Shixiang, Zoubin Ghahramani, and Richard E Turner. 2015. “Neural Adaptive Sequential Monte Carlo.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2629–37. Curran Associates, Inc. http://papers.nips.cc/paper/5961-neural-adaptive-sequential-monte-carlo.pdf.
Gu, Shixiang, Sergey Levine, Ilya Sutskever, and Andriy Mnih. 2016. “MuProp: Unbiased Backpropagation for Stochastic Neural Networks.” In Proceedings of ICLR. https://arxiv.org/abs/1511.05176v3.
He, Bobby, Balaji Lakshminarayanan, and Yee Whye Teh. 2020. “Bayesian Deep Ensembles via the Neural Tangent Kernel.” In Advances in Neural Information Processing Systems. Vol. 33. https://proceedings.neurips.cc//paper_files/paper/2020/hash/0b1ec366924b26fc98fa7b71a9c249cf-Abstract.html.
Hoffman, Matthew, and David Blei. 2015. “Stochastic Structured Variational Inference.” In PMLR, 361–69. http://proceedings.mlr.press/v38/hoffman15.html.
Hu, Zhiting, Zichao Yang, Ruslan Salakhutdinov, and Eric P. Xing. 2018. “On Unifying Deep Generative Models.” In arXiv:1706.00550 [cs, Stat]. http://arxiv.org/abs/1706.00550.
Immer, Alexander, Maciej Korzepa, and Matthias Bauer. 2021. “Improving Predictions of Bayesian Neural Nets via Local Linearization.” In International Conference on Artificial Intelligence and Statistics, 703–11. PMLR. https://proceedings.mlr.press/v130/immer21a.html.
Izmailov, Pavel, Wesley J. Maddox, Polina Kirichenko, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2020. “Subspace Inference for Bayesian Deep Learning.” In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, 1169–79. PMLR. https://proceedings.mlr.press/v115/izmailov20a.html.
Izmailov, Pavel, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. “Averaging Weights Leads to Wider Optima and Better Generalization,” March. https://arxiv.org/abs/1803.05407v3.
Karl, Maximilian, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. 2016. “Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data.” In Proceedings of ICLR. http://arxiv.org/abs/1605.06432.
Kingma, Diederik P. 2017. “Variational Inference & Deep Learning: A New Synthesis.” https://www.dropbox.com/s/v6ua3d9yt44vgb3/cover_and_thesis.pdf?dl=0.
Kingma, Diederik P., Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. 2016. “Improving Variational Inference with Inverse Autoregressive Flow.” In Advances in Neural Information Processing Systems 29. Curran Associates, Inc. http://arxiv.org/abs/1606.04934.
Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding Variational Bayes.” In ICLR 2014 Conference. http://arxiv.org/abs/1312.6114.
Krauth, Karl, Edwin V. Bonilla, Kurt Cutajar, and Maurizio Filippone. 2016. “AutoGP: Exploring the Capabilities and Limitations of Gaussian Process Models.” In Uai17. http://arxiv.org/abs/1610.05392.
Krishnan, Rahul G., Uri Shalit, and David Sontag. 2015. “Deep Kalman Filters.” arXiv Preprint arXiv:1511.05121. https://arxiv.org/abs/1511.05121.
Kristiadi, Agustinus, Matthias Hein, and Philipp Hennig. 2020. “Being Bayesian, Even Just a Bit, Fixes Overconfidence in ReLU Networks.” In ICML 2020. http://arxiv.org/abs/2002.10118.
———. 2021a. “An Infinite-Feature Extension for Bayesian ReLU Nets That Fixes Their Asymptotic Overconfidence.” arXiv:2010.02709 [cs, Stat], May. http://arxiv.org/abs/2010.02709.
———. 2021b. “Learnable Uncertainty Under Laplace Approximations.” In Uncertainty in Artificial Intelligence. http://arxiv.org/abs/2010.02720.
Larsen, Anders Boesen Lindbo, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. 2015. “Autoencoding Beyond Pixels Using a Learned Similarity Metric.” arXiv:1512.09300 [cs, Stat], December. http://arxiv.org/abs/1512.09300.
Le, Tuan Anh, Atılım Güneş Baydin, and Frank Wood. 2017. “Inference Compilation and Universal Probabilistic Programming.” In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 54:1338–48. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA: PMLR. http://arxiv.org/abs/1610.09900.
Le, Tuan Anh, Maximilian Igl, Tom Jin, Tom Rainforth, and Frank Wood. 2017. “Auto-Encoding Sequential Monte Carlo.” arXiv Preprint arXiv:1705.10306. https://arxiv.org/abs/1705.10306.
Lee, Herbert K. H., Dave M. Higdon, Zhuoxin Bi, Marco A. R. Ferreira, and Mike West. 2002. “Markov Random Field Models for High-Dimensional Parameters in Simulations of Fluid Flow in Porous Media.” Technometrics 44 (3): 230–41. https://doi.org/10.1198/004017002188618419.
Lee, Jaehoon, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. 2018. “Deep Neural Networks as Gaussian Processes.” In ICLR. http://arxiv.org/abs/1711.00165.
Liu, Xiao, Kyongmin Yeo, and Siyuan Lu. 2020. “Statistical Modeling for Spatio-Temporal Data From Stochastic Convection-Diffusion Processes.” Journal of the American Statistical Association 0 (0): 1–18. https://doi.org/10.1080/01621459.2020.1863223.
Lobacheva, Ekaterina, Nadezhda Chirkova, and Dmitry Vetrov. 2017. “Bayesian Sparsification of Recurrent Neural Networks.” In Workshop on Learning to Generate Natural Language. http://arxiv.org/abs/1708.00077.
Louizos, Christos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. “Causal Effect Inference with Deep Latent-Variable Models.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 6446–56. Curran Associates, Inc. http://papers.nips.cc/paper/7223-causal-effect-inference-with-deep-latent-variable-models.pdf.
Louizos, Christos, and Max Welling. 2016. “Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors.” In arXiv Preprint arXiv:1603.04733, 1708–16. http://arxiv.org/abs/1603.04733.
———. 2017. “Multiplicative Normalizing Flows for Variational Bayesian Neural Networks.” In PMLR, 2218–27. http://proceedings.mlr.press/v70/louizos17a.html.
MacKay, David J. C. 1992. “A Practical Bayesian Framework for Backpropagation Networks.” Neural Computation 4 (3): 448–72. https://doi.org/10.1162/neco.1992.4.3.448.
MacKay, David J C. 2002. Information Theory, Inference & Learning Algorithms. Cambridge University Press.
Maddison, Chris J., Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Whye Teh. 2017. “Filtering Variational Objectives.” arXiv Preprint arXiv:1705.09279. https://arxiv.org/abs/1705.09279.
Maddox, Wesley, Timur Garipov, Pavel Izmailov, Dmitry Vetrov, and Andrew Gordon Wilson. 2019. “A Simple Baseline for Bayesian Uncertainty in Deep Learning,” February. https://arxiv.org/abs/1902.02476v2.
Mandt, Stephan, Matthew D. Hoffman, and David M. Blei. 2017. “Stochastic Gradient Descent as Approximate Bayesian Inference.” JMLR, April. http://arxiv.org/abs/1704.04289.
Matthews, Alexander Graeme de Garis, Mark Rowland, Jiri Hron, Richard E. Turner, and Zoubin Ghahramani. 2018. “Gaussian Process Behaviour in Wide Deep Neural Networks.” arXiv:1804.11271 [cs, Stat], August. http://arxiv.org/abs/1804.11271.
Matthews, Alexander Graeme de Garis, Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. 2016. “GPflow: A Gaussian Process Library Using TensorFlow.” arXiv:1610.08733 [stat], October. http://arxiv.org/abs/1610.08733.
Molchanov, Dmitry, Arsenii Ashukha, and Dmitry Vetrov. 2017. “Variational Dropout Sparsifies Deep Neural Networks.” In Proceedings of ICML. http://arxiv.org/abs/1701.05369.
Neal, Radford M. 1996. “Bayesian Learning for Neural Networks.” Secaucus, NJ, USA: Springer-Verlag New York, Inc. http://www.csri.utoronto.ca/~radford/ftp/thesis.pdf.
Ngiam, Jiquan, Zhenghao Chen, Pang W. Koh, and Andrew Y. Ng. 2011. “Learning Deep Energy Models.” In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 1105–12. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Ngiam_557.pdf.
Oakley, Jeremy E., and Benjamin D. Youngman. 2017. “Calibration of Stochastic Computer Simulators Using Likelihood Emulation.” Technometrics 59 (1): 80–92. https://doi.org/10.1080/00401706.2015.1125391.
Papamakarios, George, Iain Murray, and Theo Pavlakou. 2017. “Masked Autoregressive Flow for Density Estimation.” In Advances in Neural Information Processing Systems 30, edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, 2338–47. Curran Associates, Inc. http://papers.nips.cc/paper/6828-masked-autoregressive-flow-for-density-estimation.pdf.
Partee, Sam, Michael Ringenburg, Benjamin Robbins, and Andrew Shao. 2019. “Model Parameter Optimization: ML-Guided Trans-Resolution Tuning of Physical Models.” In. Zenodo.
Peluchetti, Stefano, and Stefano Favaro. 2020. “Infinitely Deep Neural Networks as Diffusion Processes.” In International Conference on Artificial Intelligence and Statistics, 1126–36. PMLR. http://proceedings.mlr.press/v108/peluchetti20a.html.
Raissi, Maziar, P. Perdikaris, and George Em Karniadakis. 2019. “Physics-Informed Neural Networks: A Deep Learning Framework for Solving Forward and Inverse Problems Involving Nonlinear Partial Differential Equations.” Journal of Computational Physics 378 (February): 686–707. https://doi.org/10.1016/j.jcp.2018.10.045.
Rasmussen, Carl Edward, and Christopher K. I. Williams. 2006. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. Cambridge, Mass: MIT Press. http://www.gaussianprocess.org/gpml/.
Rezende, Danilo Jimenez, and Shakir Mohamed. 2015. “Variational Inference with Normalizing Flows.” In International Conference on Machine Learning, 1530–38. ICML’15. Lille, France: JMLR.org. http://arxiv.org/abs/1505.05770.
Rezende, Danilo J, Sébastien Racanière, Irina Higgins, and Peter Toth. 2019. “Equivariant Hamiltonian Flows.” In Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS), 6.
Ritter, Hippolyt, Martin Kukla, Cheng Zhang, and Yingzhen Li. 2021. “Sparse Uncertainty Representation in Deep Learning with Inducing Weights.” arXiv:2105.14594 [cs, Stat], May. http://arxiv.org/abs/2105.14594.
Ruiz, Francisco J. R., Michalis K. Titsias, and David M. Blei. 2016. “The Generalized Reparameterization Gradient.” In Advances In Neural Information Processing Systems. http://arxiv.org/abs/1610.02287.
Ryder, Thomas, Andrew Golightly, A. Stephen McGough, and Dennis Prangle. 2018. “Black-Box Variational Inference for Stochastic Differential Equations.” arXiv:1802.03335 [stat], February. http://arxiv.org/abs/1802.03335.
Sanchez-Gonzalez, Alvaro, Victor Bapst, Peter Battaglia, and Kyle Cranmer. 2019. “Hamiltonian Graph Networks with ODE Integrators.” In Machine Learning and the Physical Sciences Workshop at the 33rd Conference on Neural Information Processing Systems (NeurIPS), 11.
Sigrist, Fabio, Hans R. Künsch, and Werner A. Stahel. 2015. “Stochastic Partial Differential Equation Based Modelling of Large Space-Time Data Sets.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77 (1): 3–33. https://doi.org/10.1111/rssb.12061.
Tran, Dustin, Matthew D. Hoffman, Rif A. Saurous, Eugene Brevdo, Kevin Murphy, and David M. Blei. 2017. “Deep Probabilistic Programming.” In ICLR. http://arxiv.org/abs/1701.03757.
Tran, Dustin, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M. Blei. 2016. “Edward: A Library for Probabilistic Modeling, Inference, and Criticism.” arXiv:1610.09787 [cs, Stat], October. http://arxiv.org/abs/1610.09787.
Wainwright, Martin, and Michael I Jordan. 2005. “A Variational Principle for Graphical Models.” In New Directions in Statistical Signal Processing. Vol. 155. MIT Press.
Wen, Yeming, Dustin Tran, and Jimmy Ba. 2020. “BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning.” In ICLR. http://arxiv.org/abs/2002.06715.
Wilson, Andrew Gordon, and Pavel Izmailov. 2020. “Bayesian Deep Learning and a Probabilistic Perspective of Generalization,” February. https://arxiv.org/abs/2002.08791v3.
Xu, Kailai, and Eric Darve. 2020. “ADCME: Learning Spatially-Varying Physical Fields Using Deep Neural Networks.” In arXiv:2011.11955 [cs, Math]. http://arxiv.org/abs/2011.11955.
Yang, Yunfei, Zhen Li, and Yang Wang. 2021. “On the Capacity of Deep Generative Networks for Approximating Distributions.” arXiv:2101.12353 [cs, Math, Stat], January. http://arxiv.org/abs/2101.12353.
Zeevi, Assaf J., and Ronny Meir. 1997. “Density Estimation Through Convex Combinations of Densities: Approximation and Estimation Bounds.” Neural Networks: The Official Journal of the International Neural Network Society 10 (1): 99–109. https://doi.org/10.1016/S0893-6080(96)00037-8.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.