Annealing in inference

Tempering, cooling, Platt scaling…



Placeholder for a concept that has cropped up a few times of late — informally, this is where we think about changing the “temperature” of the system whose energy is given by a certain (log-)probability density, which ends up meaning that we raise the density to a power, or simply multiply it in log space.

To say more we need to get more precise. There are a few related concepts in here, I think.

Recycling data

TBC

Tempered and cold posteriors

Wenzel et al. (2020) argue, in the context of Bayesian NNs:

…[W]e demonstrate that predictive performance is improved significantly through the use of a “cold posterior” that overcounts evidence. Such cold posteriors sharply deviate from the Bayesian paradigm but are commonly used as heuristic in Bayesian deep learning papers. We put forward several hypotheses that could explain cold posteriors and evaluate the hypotheses through experiments.

Much debate was sparked. See Aitchison (2020), Adlam, Snoek, and Smith (2020), Noci et al. (2021), Izmailov et al. (2021). They also draw a parallel to Masegosa (2020) which looks somewhat interesting.

Aitchison (2020) introduces the machinery

Tempered (e.g. Zhang et al. 2018) and cold (Wenzel et al. 2020) posteriors differ slightly in how they apply the temperature parameter. For cold posteriors, we scale the whole posterior, whereas tempering is a method typically applied in variational inference, and corresponds to scaling the likelihood but not the prior, \[ \begin{aligned} \log \mathrm{P}_{\text {cold }}(\theta \mid X, Y) & =\frac{1}{T} \log \mathrm{P}(X, Y \mid \theta)+\frac{1}{T} \log \mathrm{P}(\theta)+\text { const } \\ \log \mathrm{P}_{\text {tempered }}(\theta \mid X, Y) & =\frac{1}{\lambda} \log \mathrm{P}(X, Y \mid \theta)+\log \mathrm{P}(\theta)+\text { const. } \end{aligned} \] While cold posteriors are typically used in SGLD, tempered posteriors are usually targeted by variational methods. In particular, variational methods apply temperature scaling to the KL-divergence between the approximate posterior, \(\mathrm{Q}(\theta)\) and prior, \[ \mathcal{L}=\mathbb{E}_{\mathrm{Q}(\theta)}[\log \mathrm{P}(X, Y \mid \theta)]-\lambda \mathrm{D}_{\mathrm{KL}}(\mathrm{Q}(\theta) \| \mathrm{P}(\theta)) . \] Note that the only difference between cold and tempered posteriors is whether we scale the prior, and if we have Gaussian priors over the parameters (the usual case in Bayesian neural networks), this scaling can be absorbed into the prior variance, \[ \frac{1}{T} \log \mathrm{P}_{\text {cold }}(\theta)=-\frac{1}{2 T \sigma_{\text {cold }}^2} \sum_i \theta_i^2+\text { const }=-\frac{1}{2 \sigma_{\text {tempered }}^2} \sum_i \theta_i^2+\text { const }=\log \mathrm{P}_{\text {cold }}(\theta) . \] in which case, \(\sigma_{\text {cold }}^2=\sigma_{\text {tempered }}^2 / T\), so the tempered posteriors we discuss are equivalent to cold posteriors with rescaled prior variances.

Gaussian in particular

TBC

References

Adlam, Ben, Jasper Snoek, and Samuel L. Smith. 2020. Cold Posteriors and Aleatoric Uncertainty.” In. arXiv.
Aitchison, Laurence. 2020. A Statistical Theory of Cold Posteriors in Deep Neural Networks.” In.
Barbier, Jean. 2015. Statistical Physics and Approximate Message-Passing Algorithms for Sparse Linear Estimation Problems in Signal Processing and Coding Theory.” arXiv:1511.01650 [Cs, Math], November.
Feng, Yu, and Yuhai Tu. 2021. The Inverse Variance–Flatness Relation in Stochastic Gradient Descent Is Critical for Finding Flat Minima.” Proceedings of the National Academy of Sciences 118 (9): e2015617118.
Ge, Rong, Holden Lee, and Andrej Risteski. 2020. Simulated Tempering Langevin Monte Carlo II: An Improved Proof Using Soft Markov Chain Decomposition.” arXiv:1812.00793 [Cs, Math, Stat], September.
Geman, Stuart, and Donald Geman. 1984. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6 (6): 721–41.
Goffe, William L., Gary D. Ferrier, and John Rogers. 1994. Global Optimization of Statistical Functions with Simulated Annealing.” Journal of Econometrics 60 (1-2): 65–99.
Griffin, Jim, Krys Latuszynski, and Mark Steel. 2019. In Search of Lost (Mixing) Time: Adaptive Markov Chain Monte Carlo Schemes for Bayesian Variable Selection with Very Large p.” arXiv:1708.05678 [Stat], May.
Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks.” arXiv.
Izmailov, Pavel, Sharad Vikram, Matthew D. Hoffman, and Andrew Gordon Wilson. 2021. What Are Bayesian Neural Network Posteriors Really Like? In Proceedings of the 38th International Conference on Machine Learning, 4629–40. PMLR.
Kroese, Dirk P., Zdravko I. Botev, Thomas Taimre, and Radislav Vaisman. 2019. Mathematical and Statistical Methods for Data Science and Machine Learning. First edition. Chapman & Hall/CRC Machine Learning & Pattern Recognition. Boca Raton: CRC Press.
Masegosa, Andrés R. 2020. Learning Under Model Misspecification: Applications to Variational and Ensemble Methods.” In Proceedings of the 34th International Conference on Neural Information Processing Systems, 5479–91. NIPS’20. Red Hook, NY, USA: Curran Associates Inc.
Miyahara, Hideyuki, Koji Tsumura, and Yuki Sughiyama. 2016. Relaxation of the EM Algorithm via Quantum Annealing for Gaussian Mixture Models.” In arXiv:1701.03268 [Cond-Mat, Physics:quant-Ph, Stat], 4674–79.
Nabarro, Seth, Stoil Ganev, Adrià Garriga-Alonso, Vincent Fortuin, Mark van der Wilk, and Laurence Aitchison. 2022. Data Augmentation in Bayesian Neural Networks and the Cold Posterior Effect.” In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, 1434–44. PMLR.
Neal, Radford M. 1993. Probabilistic Inference Using Markov Chain Monte Carlo Methods.” Technical Report CRGTR-93-1. Toronto Canada: Department of Computer Science, University of Toronto,.
———. 1998. Annealed Importance Sampling.” arXiv.
Noci, Lorenzo, Kevin Roth, Gregor Bachmann, Sebastian Nowozin, and Thomas Hofmann. 2021. Disentangling the Roles of Curation, Data-Augmentation and the Prior in the Cold Posterior Effect.” In Advances in Neural Information Processing Systems, 34:12738–48. Curran Associates, Inc.
Robert, Christian P., Víctor Elvira, Nick Tawn, and Changye Wu. 2018. Accelerating MCMC Algorithms.” WIREs Computational Statistics 10 (5): e1435.
Roberts, Gareth O., and Jeffrey S. Rosenthal. 2014. Minimising MCMC Variance via Diffusion Limits, with an Application to Simulated Tempering.” Annals of Applied Probability 24 (1): 131–49.
Seifert, Udo. 2012. Stochastic Thermodynamics, Fluctuation Theorems and Molecular Machines.” Reports on Progress in Physics 75 (12): 126001.
Skilling, John. 2006. Nested Sampling for General Bayesian Computation.” Bayesian Analysis 1 (4): 833–59.
Syed, Saifuddin, Alexandre Bouchard-Côté, George Deligiannidis, and Arnaud Doucet. 2020. Non-Reversible Parallel Tempering: A Scalable Highly Parallel MCMC Scheme.” arXiv:1905.02939 [Stat], November.
Welling, Max, and Yee Whye Teh. 2011. Bayesian Learning via Stochastic Gradient Langevin Dynamics.” In Proceedings of the 28th International Conference on International Conference on Machine Learning, 681–88. ICML’11. Madison, WI, USA: Omnipress.
Wenzel, Florian, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. 2020. How Good Is the Bayes Posterior in Deep Neural Networks Really? In Proceedings of the 37th International Conference on Machine Learning, 119:10248–59. PMLR.
Zanella, Giacomo, and Gareth Roberts. 2019. Scalable Importance Tempering and Bayesian Variable Selection.” Journal of the Royal Statistical Society Series B: Statistical Methodology 81 (3): 489–517.
Zhang, Guodong, Shengyang Sun, David Duvenaud, and Roger Grosse. 2018. Noisy Natural Gradient as Variational Inference.” In Proceedings of the 35th International Conference on Machine Learning, 5852–61. PMLR.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.