Evolution strategies

2026-02-16 — 2026-02-18

Wherein a neural net is trained without backprop by Gaussian perturbations, fitness differences, and antithetic pairs, the same minibatch being shared across a population to temper variance.

Bayes

dynamical systems

likelihood free

linear algebra

machine learning

Monte Carlo

neural nets

nonparametric

particle

probability

sciml

signal processing

sparser than thou

statistics

statmech

uncertainty

distributed

optimization

probabilistic algorithms

A nature-inspired ensemble strategy for training neural nets that uses evolution-like approaches, even though we usually assume that scale can only be achieved via backprop.

For the basics, see Evolution Strategies for beginners.

1 A running example

We’re reusing the running example from the ES-for-beginners post. Let us consider the challenge of training a tiny neural net without backprop.

We have a simple classifier \(f_\theta(x)\) (say, a two-layer MLP). The standard supervised objective is:

\[ L(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\big[\ell(f_\theta(x), y)\big]. \]

Backpropagation gives us \(\nabla_\theta L(\theta)\). ES assumes we won’t use that, and instead treats “run the model forward, then compute the loss” as a black box we can call:

Input: the parameters \(\theta\)
Output: a scalar score (loss, reward, accuracy, etc.)

To match the usual ES maximization setup, we define a fitness function \(F(\theta)\) to maximize, for example:

\[ F(\theta) = -L(\theta). \]

The ES loop looks like this:

Sample random perturbations \(\varepsilon_i\).
Evaluate fitness on the perturbed parameters \(\theta + \sigma \varepsilon_i\).
Combine the results to update \(\theta\).

Here \(\sigma>0\) is the “noise scale” (how far we probe). A standard choice is Gaussian perturbations (why is it always Gaussian? It makes it feel like we’re doing SGLD):

\[ \varepsilon \sim \mathcal{N}(0, I). \]

Define a smoothed objective. This is what ES targets.

\[ J(\theta) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}\big[F(\theta + \sigma \varepsilon)\big]. \]

Backprop optimizes \(F(\theta)\) directly (when it can). ES typically optimizes a Gaussian-smoothed version \(J(\theta)\).

2 Large data variants

So far this seems reasonable, but it also seems to assume full-batch losses rather than stochastic mini-batches, which, as SGD argues, are the secret to NN scaling. In the basic version, it sounds like we need the full-data likelihood, which is a non-starter.

What does ES do instead to scale to large data? What are the analogues of mini-batches and SGD updates?

I asked an LLM to soothe my qualms, and it did, citing (Lenc et al. 2019; Salimans et al. 2017)

In practice you do the same thing you do with SGD: replace the true objective (an expectation over the data distribution) with a stochastic estimate from a minibatch, and treat the resulting extra noise as part of your gradient-estimation noise.

Write the population objective as an expectation over data \(z\sim \mathcal{D}\) (example \(z=(x,y)\)):

\[ F(\theta) \;=\; \mathbb{E}_{z\sim\mathcal{D}}\big[f(\theta;z)\big] \]

(For supervised learning you can think \(f(\theta;z)=-\ell(f_\theta(x),y)\).)

Given a minibatch \(B=\{z_j\}_{j=1}^b\), define the minibatch fitness

\[ \hat F(\theta;B) \;=\; \frac{1}{b}\sum_{j=1}^b f(\theta;z_j). \]

Now you just plug \(\hat F\) into the standard ES estimator. With antithetic pairs:

\[ \hat g(\theta;B)=\frac{1}{2N\sigma}\sum_{i=1}^N \Big(\hat F(\theta+\sigma\varepsilon_i;B)-\hat F(\theta-\sigma\varepsilon_i;B)\Big)\,\varepsilon_i, \qquad \varepsilon_i\sim\mathcal N(0,I). \]

and update \(\theta\leftarrow \theta + \eta\,\hat g\).

Key practical detail: use the same minibatch \(B\) for every member of the population within an iteration, and for both \(+\varepsilon_i\) and \(-\varepsilon_i\). This “common random numbers” idea at the data sampling level makes the difference \(\hat F(\theta+\sigma\varepsilon_i;B)-\hat F(\theta-\sigma\varepsilon_i;B)\) cancel a lot of minibatch noise and makes antithetic sampling actually do its job. If each perturbation sees a different minibatch, we inject extra variance that antithetic pairing cannot cancel.

A minimal “SGD-like” recipe:

Sample a minibatch \(B\).

Sample perturbations \(\{\varepsilon_i\}_{i=1}^N\) (and use antithetic pairs).

Evaluate \(\hat F(\theta\pm\sigma\varepsilon_i;B)\) in parallel.

Form \(\hat g(\theta;B)\), optionally normalize/center fitness within the population.

Update \(\theta\).

This looks like SGD, but the gradient is estimated by perturb-and-evaluate rather than backprop. The minibatch adds stochasticity exactly the way it does in SGD: it doesn’t “break” the algorithm, but it affects the variance and therefore the batch size / population size you need for stable progress.

3 ES at very large scale

Evolution Strategies at the Hyperscale (Sarkar et al. 2025) pushes this really far, and I’m pretty fascinated by what the authors pulled off. More soon.

4 Incoming

5 References

Beyer. 1995. “Toward a Theory of Evolution Strategies: Self-Adaptation.” Evolutionary Computation.

Beyer, and Schwefel. 2002. “Evolution Strategies – A Comprehensive Introduction.” Natural Computing.

Bown, and Lexer. 2006. “Continuous-Time Recurrent Neural Networks for Generative and Interactive Musical Performance.” In Applications of Evolutionary Computing. Lecture Notes in Computer Science 3907.

Floreano, and Mattiussi. 2008. Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies (Intelligent Robotics and Autonomous Agents).

Hansen. 2023. “The CMA Evolution Strategy: A Tutorial.”

Hansen, and Ostermeier. 2001. “Completely Derandomized Self-Adaptation in Evolution Strategies.” Evolutionary Computation.

Lenc, Elsen, Schaul, et al. 2019. “Non-Differentiable Supervised Learning with Evolution Strategies and Hybrid Methods.”

Rechenberg. 1978. “Evolutionsstrategien.” In Simulationsmethoden in Der Medizin Und Biologie. Medizinische Informatik Und Statistik.

Salimans, Ho, Chen, et al. 2017. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning.”

Sarkar, Fellows, Duque, et al. 2025. “Evolution Strategies at the Hyperscale.”

Vanchurin, Wolf, Katsnelson, et al. 2021. “Towards a Theory of Evolution as Multilevel Learning.”

Whitley, Starkweather, and Bogart. 1990. “Genetic Algorithms and Neural Networks: Optimizing Connections and Connectivity.” Parallel Computing.