Bayes neural nets via subsetting weights

2022-01-11 — 2024-12-17

Wherein the practice of treating a subset of neural network weights as random variates while others are held deterministic is examined, and sequential Monte Carlo training is described.

Bayes

convolution

density

likelihood free

machine learning

neural nets

nonparametric

sparser than thou

uncertainty

Bayes NNs where only some weights are random and others are fixed. This raises various difficulties — how do you update a fixed parameter? It sounds like a sparse Bayes problem. But in sparse Bayes, we want to audition interpretable regressors for inclusion in the model. Here, we want to include uninterpretable, unidentifiable weights as random variables, but ultimately, all weights are included as either random variates or deterministic parameters.

Moving target alert! No one agrees on what to call them. For now, I use the emerging pBNNs, aka “partial Bayesian neural networks” (Zhao et al. 2024) which seems like an acceptable term.

1 Is this even principled?

At first glance, this sounds fine. But then you try to write down the equations, and stuff looks weird. How would we interpret the “posterior” of a fixed parameter? Surely there is some kind of variational argument?

Try Sharma et al. (2022) for a start.

2 How to update a deterministic parameter?

From the perspective of Bayes inference, parameters we do not update have zero prior variance. And yet we do update them by SGD. What does that mean? How can we make that statistically well-posed?

3 Last layer

The most famous one. Not that interesting, since it misses many phenomena of interest. However, so tractable that it is a good place to start. See Bayes last layer.

4 Via sequential Monte Carlo?

Zhao et al. (2024) is an elegant paper showing how to train a pBNN using sequential Monte Carlo.

5 Via singular learning theory?

The connections are definitely suggestive. The interpretation of SLT, from my understanding, would be slightly different. We might find that some parameters are locally unidentifiable, which sounds like the converse. But the setting is so similar that it’s worth investigating. See singular learning theory.

6 Probabilistic weight tying

Possibly also a form of pBNN. Rafael Oliveira has referred me to Roth and Pernkopf (2020) for some ideas on that theme.

7 References

Bhattacharya, Page, and Dunson. 2011. “Density Estimation and Classification via Bayesian Nonparametric Learning of Affine Subspaces.”

Chung, and Chung. 2014. “An Efficient Approach for Computing Optimal Low-Rank Regularized Inverse Matrices.” Inverse Problems.

Daxberger, Nalisnick, Allingham, et al. 2020. “Expressive yet Tractable Bayesian Deep Learning via Subnetwork Inference.” In.

Daxberger, Nalisnick, Allingham, et al. 2021. “Bayesian Deep Learning via Subnetwork Inference.” In Proceedings of the 38th International Conference on Machine Learning.

Durasov, Bagautdinov, Baque, et al. 2021. “Masksembles for Uncertainty Estimation.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Dusenberry, Jerfel, Wen, et al. 2020. “Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors.” In Proceedings of the 37th International Conference on Machine Learning.

Izmailov, Maddox, Kirichenko, et al. 2020. “Subspace Inference for Bayesian Deep Learning.” In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference.

Kampen, Als, and Andersen. 2024. “Towards Scalable Bayesian Transformers: Investigating Stochastic Subset Selection for NLP.” In.

Ke, and Fan. 2022. “On the Optimization and Pruning for Bayesian Deep Learning.”

Kowal. 2022. “Bayesian Subset Selection and Variable Importance for Interpretable Prediction and Classification.”

Mahesh, Collins, Bonev, et al. 2024. “Huge Ensembles Part I: Design of Ensemble Weather Forecasts Using Spherical Fourier Neural Operators.”

Page, Bhattacharya, and Dunson. 2013. “Classification via Bayesian Nonparametric Learning of Affine Subspaces.” Journal of the American Statistical Association.

Roth, and Pernkopf. 2020. “Bayesian Neural Networks with Weight Sharing Using Dirichlet Processes.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Sharma, Farquhar, Nalisnick, et al. 2022. “Do Bayesian Neural Networks Need To Be Fully Stochastic?”

Spantini, Cui, Willcox, et al. 2017. “Goal-Oriented Optimal Approximations of Bayesian Linear Inverse Problems.” SIAM Journal on Scientific Computing.

Spantini, Solonen, Cui, et al. 2015. “Optimal Low-Rank Approximations of Bayesian Linear Inverse Problems.” SIAM Journal on Scientific Computing.

Thomas, You, Lin, et al. 2022. “Learning Subspaces of Different Dimensions.” Journal of Computational and Graphical Statistics.

Tran, M.-N., Nguyen, Nott, et al. 2019. “Bayesian Deep Net GLM and GLMM.” Journal of Computational and Graphical Statistics.

Tran, Ba-Hien, Rossi, Milios, et al. 2022. “All You Need Is a Good Functional Prior for Bayesian Deep Learning.” Journal of Machine Learning Research.

Wei, and Lau. 2023. “Variational Bayesian Neural Networks via Resolution of Singularities.” Journal of Computational and Graphical Statistics.

Yu, Wang, Shan, et al. 2024. “The Super Weight in Large Language Models.”

Zhao, Mair, Schön, et al. 2024. “On Feynman-Kac Training of Partial Bayesian Neural Networks.” In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics.