Modern SGD algorithms are often of the “adaptive” flavour, which means that the learning rate is adaptively tuned for each parameter during the learning process, putting them somewhere between first order and second order.
Justifications for the adaptation are some mixture of theoretical and empirical, and most of all economical: Many very big companies are based on the most famous method Adam, so it’s pretty good. Sebastian Ruder over 2016-2020 maintained an overview of gradient descent optimization algorithms, which is my favourite introduction to the topic. Read that before hoping to get anything new from me.
There are explicitly 2nd-order flavours (Shampoo springs to mind).
1 Adam
The Adam optimizer (Kingma and Ba 2017) updates parameters using estimates of first and second moments of the gradients. As the most popular optimizer in deep learning, it is worth understanding, but by the same token, analyzed to oblivion. There are many competent introductions to Adam. Here are two:
- Adam - Cornell University Computational Optimization Open Textbook
- Rahul Agarwal, Complete Guide to the Adam Optimization Algorithm
The update rule for the parameter
where
is the learning rate at time . is the exponentially decaying average of past gradients (first moment estimate). is the exponentially decaying average of past squared gradients (second moment estimate). is a small constant to prevent division by zero.
This is the workhorse of modern ML. Much has been written about it. Do I have anything new to add? Honestly, I do not know, because there is so much written about it that search through it is overwhelming, so I write out the bits I need here.
We can interpret the Adam update as solving a regularized least-squares problem at each time step. Specifically, we consider the following optimization problem:
where
is a positive definite matrix (regularization term). is the gradient of the loss function at .- The first term penalizes deviations from the current parameter estimate
. - The second term encourages movement in the direction of the negative gradient.
To find
Solving for
Adam uses some extra tricks I have not mentioned here, like bias correction terms for
In the Adam optimizer, the parameters are updated using the first moment estimate
- Replace
with (assuming ). - Let
.
With these substitutions, the update becomes
which matches the Adam update rule.
1.1 As a Bayes procedure 1
Least squares is only ever one Bayesian saying well actually… away from being a Gaussian posterior updates.
Suppose the parameter vector at time
where
is the mean of the prior distribution. is the covariance matrix of the prior, reflecting our confidence in the current estimate .
The likelihood function is based on observing a gradient
where the likelihood encourages
where
This matches the update rule derived from the regularized least-squares problem and the Adam optimizer. Let’s consider when the likelihood function
might be a meaningful choice. In those terms, what likelihood function have we just assumed? We need to find a likelihood function
However, a standard Gaussian likelihood leads to a quadratic term in
Let’s define a synthetic observation
where
is a matrix we will define. is Gaussian noise: . Our goal is to choose and such that the negative log-likelihood corresponds to the linear term in our optimization problem.
The likelihood function is
The negative log-likelihood (up to a constant) is:
If we fix
If we assume that
This is an assumption that gives us a linear term in
To find the posterior mean which is also the posterior maximum
Recall that
While that was coherent, I am not sure I learned anything. This still seems arbitrary; for one thing, the Adam update uses a square root of a variance matrix, which does not naturally arise here.
1.2 As a Bayes procedure 2
Let us try something different and see if we can get a more plausible rationale. We propose that the uncertainty in the parameters is proportional to the square root of the gradient variance,
This means that the standard deviation (uncertainty) of each parameter
where
is the mean of the prior distribution (current estimate). is the prior covariance matrix, reflecting our uncertainty about . We set the prior covariance matrix to
where
where
where
Given that
Assuming
Substituting
This matches the Adam update rule (neglecting the small constant
By setting the parameter uncertainty proportional to the square root of the gradient variance, we adjust the parameter updates based on how much we “trust” the gradient information:
- High Gradient Variance (
large): Indicates high uncertainty in the gradient estimate for parameter , leading to a small update due to larger in the denominator. - Low Gradient Variance (
small): Indicates more reliable gradient information, leading to a larger update for that parameter.
The coincidence of the prior noise happening to be proportional to the square root of the gradient variance still feels a little bit arbitrary. We are using those 2nd moment estimates twice, once to invent the prior and then again for likelihood, and in a strange way.
1.3 As a Bayes procedure 3
OK, that was weirder than I expected, especially because of that danged square root. It can be eliminated, but with a more involved method (Lin et al. 2024). Maybe we should follow M. E. Khan and Rue (2024) and interpret it as a natural gradient method. TBC
1.4 As a gradient flow
See gradient flows
1.5 AdamW
TBD.
2 Nadam
TBD
3 Adagrad
TBD
4 RMSprop
TBD
5 As a natural gradient method
6 Sparse variants
TBD