# Annealing in inference

Tempering, cooling, Platt scaling…

September 30, 2020 — September 4, 2023

Placeholder for a concept that has cropped up a few times of late — informally, this is where we think about changing the “temperature” of the system whose energy is given by a certain (log-)probability density, which ends up meaning that we raise the density to a power, or simply multiply it in log space.

To say more we need to get more precise. There are a few related concepts in here, I think.

## 1 Recycling data

TBC

## 2 Tempered and cold posteriors

Wenzel et al. (2020) argue, in the context of Bayesian NNs:

…[W]e demonstrate that predictive performance is improved significantly through the use of a “cold posterior” that overcounts evidence. Such cold posteriors sharply deviate from the Bayesian paradigm but are commonly used as heuristic in Bayesian deep learning papers. We put forward several hypotheses that could explain cold posteriors and evaluate the hypotheses through experiments.

Much debate was sparked. See Aitchison (2020), Adlam, Snoek, and Smith (2020), Noci et al. (2021), Izmailov et al. (2021). They also draw a parallel to Masegosa (2020) which looks somewhat interesting.

Aitchison (2020) introduces the machinery

Tempered (e.g. Zhang et al. 2018) and cold (Wenzel et al. 2020) posteriors differ slightly in how they apply the temperature parameter. For cold posteriors, we scale the whole posterior, whereas tempering is a method typically applied in variational inference, and corresponds to scaling the likelihood but not the prior, \[ \begin{aligned} \log \mathrm{P}_{\text {cold }}(\theta \mid X, Y) & =\frac{1}{T} \log \mathrm{P}(X, Y \mid \theta)+\frac{1}{T} \log \mathrm{P}(\theta)+\text { const } \\ \log \mathrm{P}_{\text {tempered }}(\theta \mid X, Y) & =\frac{1}{\lambda} \log \mathrm{P}(X, Y \mid \theta)+\log \mathrm{P}(\theta)+\text { const. } \end{aligned} \] While cold posteriors are typically used in SGLD, tempered posteriors are usually targeted by variational methods. In particular, variational methods apply temperature scaling to the KL-divergence between the approximate posterior, \(\mathrm{Q}(\theta)\) and prior, \[ \mathcal{L}=\mathbb{E}_{\mathrm{Q}(\theta)}[\log \mathrm{P}(X, Y \mid \theta)]-\lambda \mathrm{D}_{\mathrm{KL}}(\mathrm{Q}(\theta) \| \mathrm{P}(\theta)) . \] Note that the only difference between cold and tempered posteriors is whether we scale the prior, and if we have Gaussian priors over the parameters (the usual case in Bayesian neural networks), this scaling can be absorbed into the prior variance, \[ \frac{1}{T} \log \mathrm{P}_{\text {cold }}(\theta)=-\frac{1}{2 T \sigma_{\text {cold }}^2} \sum_i \theta_i^2+\text { const }=-\frac{1}{2 \sigma_{\text {tempered }}^2} \sum_i \theta_i^2+\text { const }=\log \mathrm{P}_{\text {cold }}(\theta) . \] in which case, \(\sigma_{\text {cold }}^2=\sigma_{\text {tempered }}^2 / T\), so the tempered posteriors we discuss are equivalent to cold posteriors with rescaled prior variances.

## 3 Gaussian in particular

TBC

## 4 References

*arXiv:1511.01650 [Cs, Math]*.

*Proceedings of the National Academy of Sciences*.

*arXiv:1812.00793 [Cs, Math, Stat]*.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*.

*Journal of Econometrics*.

*arXiv:1708.05678 [Stat]*.

*Proceedings of the 38th International Conference on Machine Learning*.

*Mathematical and Statistical Methods for Data Science and Machine Learning*. Chapman & Hall/CRC Machine Learning & Pattern Recognition.

*Proceedings of the 34th International Conference on Neural Information Processing Systems*. NIPS’20.

*arXiv:1701.03268 [Cond-Mat, Physics:quant-Ph, Stat]*.

*Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence*.

*Advances in Neural Information Processing Systems*.

*WIREs Computational Statistics*.

*Annals of Applied Probability*.

*Reports on Progress in Physics*.

*Bayesian Analysis*.

*arXiv:1905.02939 [Stat]*.

*Proceedings of the 28th International Conference on International Conference on Machine Learning*. ICML’11.

*Proceedings of the 37th International Conference on Machine Learning*.

*Journal of the Royal Statistical Society Series B: Statistical Methodology*.

*Proceedings of the 35th International Conference on Machine Learning*.