Evolution strategies
2026-02-16 — 2026-02-18
Wherein a neural net is trained without backprop by Gaussian perturbations, fitness differences, and antithetic pairs, the same minibatch being shared across a population to temper variance.
A nature-inspired ensemble strategy for training neural nets that uses evolution-like approaches, even though we usually assume that scale can only be achieved via backprop.
For the basics, see Evolution Strategies for beginners.
1 A running example
We’re reusing the running example from the ES-for-beginners post. Let us consider the challenge of training a tiny neural net without backprop.
We have a simple classifier \(f_\theta(x)\) (say, a two-layer MLP). The standard supervised objective is:
\[ L(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\big[\ell(f_\theta(x), y)\big]. \]
Backpropagation gives us \(\nabla_\theta L(\theta)\). ES assumes we won’t use that, and instead treats “run the model forward, then compute the loss” as a black box we can call:
- Input: the parameters \(\theta\)
- Output: a scalar score (loss, reward, accuracy, etc.)
To match the usual ES maximization setup, we define a fitness function \(F(\theta)\) to maximize, for example:
\[ F(\theta) = -L(\theta). \]
The ES loop looks like this:
- Sample random perturbations \(\varepsilon_i\).
- Evaluate fitness on the perturbed parameters \(\theta + \sigma \varepsilon_i\).
- Combine the results to update \(\theta\).
Here \(\sigma>0\) is the “noise scale” (how far we probe). A standard choice is Gaussian perturbations (why is it always Gaussian? It makes it feel like we’re doing SGLD):
\[ \varepsilon \sim \mathcal{N}(0, I). \]
Define a smoothed objective. This is what ES targets.
\[ J(\theta) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}\big[F(\theta + \sigma \varepsilon)\big]. \]
Backprop optimizes \(F(\theta)\) directly (when it can). ES typically optimizes a Gaussian-smoothed version \(J(\theta)\).
2 Large data variants
So far this seems reasonable, but it also seems to assume full-batch losses rather than stochastic mini-batches, which, as SGD argues, are the secret to NN scaling. In the basic version, it sounds like we need the full-data likelihood, which is a non-starter.
What does ES do instead to scale to large data? What are the analogues of mini-batches and SGD updates?
I asked an LLM to soothe my qualms, and it did, citing (Lenc et al. 2019; Salimans et al. 2017)
In practice you do the same thing you do with SGD: replace the true objective (an expectation over the data distribution) with a stochastic estimate from a minibatch, and treat the resulting extra noise as part of your gradient-estimation noise.
Write the population objective as an expectation over data \(z\sim \mathcal{D}\) (example \(z=(x,y)\)):
\[ F(\theta) \;=\; \mathbb{E}_{z\sim\mathcal{D}}\big[f(\theta;z)\big] \]
(For supervised learning you can think \(f(\theta;z)=-\ell(f_\theta(x),y)\).)
Given a minibatch \(B=\{z_j\}_{j=1}^b\), define the minibatch fitness
\[ \hat F(\theta;B) \;=\; \frac{1}{b}\sum_{j=1}^b f(\theta;z_j). \]
Now you just plug \(\hat F\) into the standard ES estimator. With antithetic pairs:
\[ \hat g(\theta;B)=\frac{1}{2N\sigma}\sum_{i=1}^N \Big(\hat F(\theta+\sigma\varepsilon_i;B)-\hat F(\theta-\sigma\varepsilon_i;B)\Big)\,\varepsilon_i, \qquad \varepsilon_i\sim\mathcal N(0,I). \]
and update \(\theta\leftarrow \theta + \eta\,\hat g\).
Key practical detail: use the same minibatch \(B\) for every member of the population within an iteration, and for both \(+\varepsilon_i\) and \(-\varepsilon_i\). This “common random numbers” idea at the data sampling level makes the difference \(\hat F(\theta+\sigma\varepsilon_i;B)-\hat F(\theta-\sigma\varepsilon_i;B)\) cancel a lot of minibatch noise and makes antithetic sampling actually do its job. If each perturbation sees a different minibatch, we inject extra variance that antithetic pairing cannot cancel.
A minimal “SGD-like” recipe:
- Sample a minibatch \(B\).
- Sample perturbations \(\{\varepsilon_i\}_{i=1}^N\) (and use antithetic pairs).
- Evaluate \(\hat F(\theta\pm\sigma\varepsilon_i;B)\) in parallel.
- Form \(\hat g(\theta;B)\), optionally normalize/center fitness within the population.
- Update \(\theta\).
This looks like SGD, but the gradient is estimated by perturb-and-evaluate rather than backprop. The minibatch adds stochasticity exactly the way it does in SGD: it doesn’t “break” the algorithm, but it affects the variance and therefore the batch size / population size you need for stable progress.
3 ES at very large scale
Evolution Strategies at the Hyperscale (Sarkar et al. 2025) pushes this really far, and I’m pretty fascinated by what the authors pulled off. More soon.
