Evolution strategies for neural nets

2026-02-16 — 2026-02-16

Wherein neural nets are trained without backprop, by Gaussian perturbations and antithetic pairs; a smoothed fitness is maximized through many forward passes and only scalar reports.

Bayes
dynamical systems
likelihood free
linear algebra
machine learning
Monte Carlo
neural nets
nonparametric
particle
probability
sciml
signal processing
sparser than thou
statistics
statmech
uncertainty
distributed
optimization
probabilistic algorithms
Figure 1

A nature-inspired ensemble strategy for training neural nets that uses evolution-like approaches to training NNs, even though we usually assume that scale can only be achieved via backprop.

1 Evolution Strategies for beginners

If we’ve trained neural networks with backprop, we’re used to this workflow:

  1. Pick a loss \(L(\theta)\) for parameters \(\theta\).
  2. Compute \(\nabla_\theta L(\theta)\) by differentiating through the network.
  3. Update \(\theta \leftarrow \theta - \eta \nabla_\theta L(\theta)\).

Evolution Strategies are an alternative way of handling step 2, with the remarkable possibility of working when backprop is painful or impossible: the system is non-differentiable, gradients are too noisy, the “model” includes discrete choices or simulator code, or we want to optimize something we only observe as an outcome (a score, a reward, an accuracy).

The neat trick is that ES can still look “gradient-like” in the end: we perturb parameters, evaluate the perturbed models, and combine those evaluations into an update that points (in expectation) in a direction of improvement. This technique was used in parallel with backprop for a lot of history, before backprop became dominant in the 2000s (Beyer 1995; Beyer and Schwefel 2002; Rechenberg 1978).

2 A running example

Let uss consider the challenge of training a tiny neural net without backprop.

Suppose we have a simple classifier \(f_\theta(x)\) (say, a two-layer MLP). The standard supervised objective is:

\[ L(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\big[\ell(f_\theta(x), y)\big]. \]

Backprop gives us \(\nabla_\theta L(\theta)\). ES assumes we won’t use that, and instead treats “run the model forward and compute the loss” as a black box we can call:

  • Input: parameters \(\theta\)
  • Output: a scalar score (loss, reward, accuracy, etc.)

To match the usual ES maximization framing, we define a fitness function \(F(\theta)\) to maximize, for example:

\[ F(\theta) = -L(\theta). \]

The ES loop is as follows:

  1. Sample random perturbations \(\varepsilon_i\).
  2. Evaluate fitness at the perturbed parameters \(\theta + \sigma \varepsilon_i\).
  3. Combine the results into an update for \(\theta\).

Here \(\sigma>0\) is the “noise scale” (how far we probe).

3 Basic version

A standard choice is Gaussian perturbations (why is it always Gaussian? Makes it look like SGLD):

\[ \varepsilon \sim \mathcal{N}(0, I). \]

Define a smoothed objective:

\[ J(\theta) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}\big[F(\theta + \sigma \varepsilon)\big]. \]

Backprop optimizes \(F(\theta)\) directly (when it can). ES typically optimizes a Gaussian-smoothed version \(J(\theta)\).

This familiar to NN training by Kalman filters, which also optimizes a smoothed objective, but with fixed update ensemble.

That smoothing is advertised to have many benefits— makes discontinuities and jagged landscapes more manageable, at the cost of bias controlled by \(\sigma\).

Fun fact: \(J(\theta)\) has a gradient even if \(F\) isn’t differentiable, and we can even write it without differentiating through \(F\):

\[ \nabla_\theta J(\theta) = \frac{1}{\sigma}\,\mathbb{E}_{\varepsilon}\big[F(\theta + \sigma \varepsilon)\,\varepsilon\big]. \]

This is a score-function / log-likelihood trick specialised to Gaussians. We can read it as:

  • “If perturbations along direction \(\varepsilon\) tend to increase fitness, we move \(\theta\) along direction \(\varepsilon\).”

A Monte Carlo estimate with \(N\) samples is:

\[ \hat{g} = \frac{1}{N\sigma}\sum_{i=1}^N F(\theta + \sigma \varepsilon_i)\,\varepsilon_i, \qquad \theta \leftarrow \theta + \eta \hat{g}. \]

4 Variance reduction by antithetic sampling

A variance reduction trick we can almost always use is antithetic pairs. For each \(\varepsilon_i\), evaluate both \(\theta+\sigma\varepsilon_i\) and \(\theta-\sigma\varepsilon_i\). Then take the difference:

\[ \hat{g}_{\pm} = \frac{1}{2N\sigma}\sum_{i=1}^N \big(F(\theta + \sigma \varepsilon_i) - F(\theta - \sigma \varepsilon_i)\big)\,\varepsilon_i. \]

Intuition: symmetric noise cancels a lot of baseline effects and reduces estimator variance.

If we want a “drop-in” recipe for gradient descent::

  • Replace “backward pass” with “evaluate \(2N\) forward passes under perturbed weights”.
  • Replace “gradient tensor” with “weighted sum of perturbation tensors”.

What stays the same:

  • We still run forward passes.
  • We still optimize a scalar objective.
  • We still do iterative updates.

What changes:

  1. No computational graph / no chain rule. Our model can include discrete ops, integer arithmetic, if-statements, simulators, non-differentiable reward shaping—anything.
  2. We pay in evaluations. ES typically needs a population of evaluations per update to get a decent gradient estimate.
  3. Parallelism looks different. Each population member is independent: evaluate on different devices, report back one scalar fitness each, combine centrally (or via all-reduce on scalars).

A useful way to frame it:

  • Backprop cost is “one forward + one backward” (plus distributed gradient comms).
  • ES cost is “many forwards, no backward” (plus scalar comms).

So ES tends to make sense when:

  • forward passes are cheap or massively parallelizable,
  • backward passes are awkward/expensive/impossible,
  • the objective is noisy or outcome-only,
  • we want robustness to nasty landscapes (e.g., long-horizon RL, recurrent instability).

And ES tends to be a bad deal when: - gradients are available and stable, - we’re compute-limited and can’t parallelize lots of forwards, - we need the sample efficiency of gradient methods.

5 Practical ES details we need to not get wrecked by variance

An LLM advises me that I’ll need to stabilize variance using some tricks:

  1. Fitness normalization / shaping. Replace raw \(F_i\) with normalized scores (subtract mean and divide by std), or rank-transform them. This reduces sensitivity to outliers and scale.
  2. Common random numbers. Use the same minibatch/data for evaluating the population in one generation so that differences are due to parameters, not data noise. This sounds like a spanner in the works to me - see my argument with an LLM under large data
  3. Control variates / baselines. The antithetic form is one; subtracting the mean fitness is another.
  4. Step-size and \(\sigma\) scheduling. \(\sigma\) is both and smoothing radius and can affect both bias and variance.
  • ES is optimizing \(J(\theta)=\mathbb{E}[F(\theta+\sigma\varepsilon)]\).
  • If \(\sigma\) is large, we are not optimizing the original objective tightly; we are optimizing a blurred version of it.
  • If \(\sigma\) is tiny, the estimator variance can blow up unless \(N\) is large.

This all seems reasonable, but it seems to presume full-batch losses rather than stochastic mini-batches, which, as SGD is the secret to NN scaling. What does ES do instead?

6 Large data variants

Minibatches, SGD-analogues.

Above it sounds like we need the full data likelihood, which is a non-starter.

I asked an LLM to sooth my qualms in this section, which it did, citing (Lenc et al. 2019; Salimans et al. 2017)

It doesn’t. In practice you do the same thing you do with SGD: replace the true objective (an expectation over the data distribution) with a stochastic estimate from a minibatch, and treat the resulting extra noise as part of your gradient-estimation noise.

Write the population objective as an expectation over data \(z\sim \mathcal{D}\) (example \(z=(x,y)\)):

\[ F(\theta) \;=\; \mathbb{E}_{z\sim\mathcal{D}}\big[f(\theta;z)\big] \]

(For supervised learning you can think \(f(\theta;z)=-\ell(f_\theta(x),y)\).)

Given a minibatch \(B=\{z_j\}_{j=1}^b\), define the minibatch fitness

\[ \hat F(\theta;B) \;=\; \frac{1}{b}\sum_{j=1}^b f(\theta;z_j). \]

Now you just plug \(\hat F\) into the standard ES estimator. With antithetic pairs:

\[ \hat g(\theta;B)=\frac{1}{2N\sigma}\sum_{i=1}^N \Big(\hat F(\theta+\sigma\varepsilon_i;B)-\hat F(\theta-\sigma\varepsilon_i;B)\Big)\,\varepsilon_i, \qquad \varepsilon_i\sim\mathcal N(0,I). \]

and update \(\theta\leftarrow \theta + \eta\,\hat g\).

Key practical detail: use the same minibatch \(B\) for every member of the population within an iteration, and for both \(+\varepsilon_i\) and \(-\varepsilon_i\). This “common random numbers” idea at the data sampling level makes the difference \(\hat F(\theta+\sigma\varepsilon_i;B)-\hat F(\theta-\sigma\varepsilon_i;B)\) cancel a lot of minibatch noise and makes antithetic sampling actually do its job. If each perturbation sees a different minibatch, we inject extra variance that antithetic pairing cannot cancel.

A minimal “SGD-like” recipe:

  1. Sample a minibatch \(B\).
  2. Sample perturbations \(\{\varepsilon_i\}_{i=1}^N\) (and use antithetic pairs).
  3. Evaluate \(\hat F(\theta\pm\sigma\varepsilon_i;B)\) in parallel.
  4. Form \(\hat g(\theta;B)\), optionally normalize/center fitness within the population.
  5. Update \(\theta\).

This looks like SGD, but the gradient is estimated by perturb-and-evaluate rather than backprop. The minibatch adds stochasticity exactly the way it does in SGD: it doesn’t “break” the algorithm, but it affects the variance and therefore the batch size / population size you need for stable progress.

7 ES at very large scale

Evolution Strategies at the Hyperscale (Sarkar et al. 2025) pushes this all very far and I’m quite fascinated by it.

8 Incoming

9 References

Beyer. 1995. Toward a Theory of Evolution Strategies: Self-Adaptation.” Evolutionary Computation.
Beyer, and Schwefel. 2002. Evolution Strategies – A Comprehensive Introduction.” Natural Computing.
Bown, and Lexer. 2006. Continuous-Time Recurrent Neural Networks for Generative and Interactive Musical Performance.” In Applications of Evolutionary Computing. Lecture Notes in Computer Science 3907.
Floreano, and Mattiussi. 2008. Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies (Intelligent Robotics and Autonomous Agents).
Hansen. 2023. The CMA Evolution Strategy: A Tutorial.”
Hansen, and Ostermeier. 2001. Completely Derandomized Self-Adaptation in Evolution Strategies.” Evolutionary Computation.
Lenc, Elsen, Schaul, et al. 2019. Non-Differentiable Supervised Learning with Evolution Strategies and Hybrid Methods.”
Rechenberg. 1978. Evolutionsstrategien.” In Simulationsmethoden in Der Medizin Und Biologie. Medizinische Informatik Und Statistik.
Salimans, Ho, Chen, et al. 2017. Evolution Strategies as a Scalable Alternative to Reinforcement Learning.”
Sarkar, Fellows, Duque, et al. 2025. Evolution Strategies at the Hyperscale.”
Vanchurin, Wolf, Katsnelson, et al. 2021. Towards a Theory of Evolution as Multilevel Learning.”
Whitley, Starkweather, and Bogart. 1990. Genetic Algorithms and Neural Networks: Optimizing Connections and Connectivity.” Parallel Computing.