Evolution strategies

Slimmest possible implementation of “evolutionary” optimization

2026-02-16 — 2026-02-18

Wherein Gaussian perturbations are applied to neural weights, fitnesses from many forward passes are gathered, and an antithetic-pair update is formed without any backward computation.

Bayes

dynamical systems

likelihood free

linear algebra

machine learning

Monte Carlo

neural nets

nonparametric

particle

probability

sciml

signal processing

sparser than thou

statistics

statmech

uncertainty

distributed

optimization

probabilistic algorithms

It’s a nature-inspired ensemble strategy for optimization that uses evolution-like approaches. I ignored this for ages, but it includes lots of interesting and suggestive ideas. It’s also not much like other, more earnest attempts to copy evolutionary behaviour, such as the genetic programming approach to symbolic regression, which has not been that useful. Instead, it’s perhaps best thought of as a clever Monte Carlo method for optimization. It resembles Stein VGD if we squint at it funny.

ES (Evolution Strategies) is a fairly generic optimization method, but let’s talk about it in terms of neural networks, because that is what I’m thinking about, and also because the older literature needs updating.

1 Evolution Strategies for beginners

If we’ve trained neural networks with backprop, we’re used to this workflow:

Pick a loss \(L(\theta)\) for parameters \(\theta\).
Compute \(\nabla_\theta L(\theta)\) by differentiating through the network.
Update \(\theta \leftarrow \theta - \eta \nabla_\theta L(\theta)\).

Evolution Strategies are an alternative way of handling step 2, with the remarkable possibility of working even when backprop is painful or impossible: when the system is non-differentiable, gradients are too noisy, the “model” includes discrete choices or simulator code, or we want to optimize something we only observe as an outcome (a score, a reward, an accuracy).

The neat trick is that ES can still look “gradient-like” in the end: we perturb parameters, evaluate the perturbed models, and combine those evaluations into an update that points (in expectation) in a direction of improvement. This technique was used alongside backprop for much of the field’s history, before backprop became dominant in the 2000s (Beyer 1995; Beyer and Schwefel 2002; Rechenberg 1978).

2 A running example

Let us consider the challenge of training a tiny neural net without backprop.

Suppose we have a simple classifier \(f_\theta(x)\) (say, a two-layer MLP). The standard supervised objective is:

\[ L(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\big[\ell(f_\theta(x), y)\big]. \]

Backpropagation gives us \(\nabla_\theta L(\theta)\). ES assumes we won’t use it, and instead treats “run the model forward, then compute the loss” as a black box we can call:

Input: the parameters \(\theta\)
Output: a scalar score (loss, reward, accuracy, and so on)

To match the usual ES maximization framing, we define a fitness function \(F(\theta)\) that we maximize; for example:

\[ F(\theta) = -L(\theta). \]

The ES loop looks like this:

Sample random perturbations \(\varepsilon_i\).
Evaluate fitness at the perturbed parameters \(\theta + \sigma \varepsilon_i\).
Combine the results into an update for \(\theta\).

Here \(\sigma>0\) is the “noise scale” (how far we probe).

3 Basic version

A standard choice is Gaussian perturbations (why is it always Gaussian, anyway? It feels a bit like SGLD):

\[ \varepsilon \sim \mathcal{N}(0, I). \]

Let’s define the smoothed objective:

\[ J(\theta) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}\big[F(\theta + \sigma \varepsilon)\big]. \]

Backprop optimizes \(F(\theta)\) directly (when it can). ES typically optimizes a Gaussian-smoothed version, \(J(\theta)\).

We’ve seen this before in NN training by Kalman filters, which also optimizes a smoothed objective, but with a fixed update ensemble.

We often point to that smoothing as having lots of benefits: it makes discontinuities and jagged landscapes more manageable, at the cost of a bias we control with \(\sigma\).

Fun fact: \(J(\theta)\) has a gradient even if \(F\) isn’t differentiable, and we can write it without differentiating through \(F\):

\[ \nabla_\theta J(\theta) = \frac{1}{\sigma}\,\mathbb{E}_{\varepsilon}\big[F(\theta + \sigma \varepsilon)\,\varepsilon\big]. \]

This is a score-function / log-likelihood trick specialised to Gaussians. We can read it as:

“If perturbations in direction \(\varepsilon\) tend to increase fitness, we move \(\theta\) in direction \(\varepsilon\).”

A Monte Carlo estimate using \(N\) samples is:

\[ \hat{g} = \frac{1}{N\sigma}\sum_{i=1}^N F(\theta + \sigma \varepsilon_i)\,\varepsilon_i, \qquad \theta \leftarrow \theta + \eta \hat{g}. \]

4 Variance reduction with antithetic sampling

A variance reduction trick we can almost always use is antithetic pairs. For each \(\varepsilon_i\), evaluate both \(\theta+\sigma\varepsilon_i\) and \(\theta-\sigma\varepsilon_i\). Then take their difference:

\[ \hat{g}_{\pm} = \frac{1}{2N\sigma}\sum_{i=1}^N \big(F(\theta + \sigma \varepsilon_i) - F(\theta - \sigma \varepsilon_i)\big)\,\varepsilon_i. \]

Intuition: symmetric noise cancels a lot of baseline effects and reduces estimator variance.

If we want a drop-in recipe for gradient descent:

Replace “backward pass” with “evaluate \(2N\) forward passes with perturbed weights”.
Replace “gradient tensor” with “a weighted sum of perturbation tensors”.

What stays the same:

We still run forward passes.
We still optimize a scalar objective.
We still do iterative updates.

What changes:

No computational graph / no chain rule. Our model can include discrete operations, integer arithmetic, if statements, simulators, non-differentiable reward shaping—anything.
We pay in evaluations. ES typically needs a population of evaluations per update to get a decent gradient estimate.
Parallelism looks different. Each population member is independent: we evaluate on different devices, report back one scalar fitness value each, then combine centrally (or via all-reduce on scalars).

A useful way to frame it:

Backprop cost is “one forward + one backward” (plus distributed gradient comms).
ES cost is “many forwards, no backward” (plus scalar comms).

So ES tends to make sense when:

forward passes are cheap or massively parallelizable,
backward passes are awkward/expensive/impossible,
the objective is noisy or outcome-only,
we want robustness to nasty landscapes (e.g., long-horizon RL, recurrent instability).

And ES tends to be a bad deal when:

gradients are available and stable,
we’re compute-limited and can’t parallelize lots of forwards,
we need the sample efficiency of gradient methods.

5 Practical ES details

An LLM advises me that I’ll need to stabilize variance using some (standard) tricks:

Fitness normalization / shaping. Replace raw \(F_i\) with normalized scores (subtract mean and divide by standard deviation), or rank-transform them. This reduces sensitivity to outliers and scale.
Common random numbers. Use the same minibatch/data for evaluating the population in one generation so that differences are due to parameters, not data noise. This sounds like a spanner in the works for me — see my argument with an LLM under large data.
Control variates / baselines. The antithetic form is one; subtracting the mean fitness is another.
Step size and \(\sigma\) scheduling. \(\sigma\) is both a smoothing radius and can affect both bias and variance.

So \(\sigma\) is a free parameter. What does it do?

ES is optimizing \(J(\theta)=\mathbb{E}[F(\theta+\sigma\varepsilon)]\).
If \(\sigma\) is large, we are not optimizing the original objective tightly; we are optimizing a blurred version of it.
If \(\sigma\) is tiny, the estimator variance can blow up unless \(N\) is large.

6 ES meets Bayes

ES looks “Bayesian” in the sense that it maintains and updates a probability distribution. And it reuses many classic Bayesian sampling tricks, such as quasi-Langevin dynamics. Let us make the correspondence more precise, though.

Bayesian optimization (BO) puts a Bayesian posterior on the unknown objective function \(f(\cdot)\) (often a GP), then chooses evaluation points via an acquisition rule balancing exploration/exploitation (Brochu, Cora, and Freitas 2010).
Evolution Strategies (ES) (and especially NES/CMA-ES-style methods) put a distribution on candidate solutions \(\theta\) (or search steps), and update that distribution using fitness evaluations. There is typically no posterior over the function \(f\) itself.

What follows are some ways the “Bayes connection” has been made precise in the literature.

6.1 What is the “smoothing / search-device” distribution?

In vanilla ES (and in “variational optimization”), we introduce an auxiliary distribution over parameters and optimize the expected objective under that distribution. This distribution is not a Bayesian posterior, but rather something we choose as part of the optimizer.

Let \(F(\theta)\) be the fitness we want to maximize (e.g. negative loss). Pick a parametric search distribution \[ q_\phi(\theta) \] and define the ES/VO objective. \[ J(\phi) \;=\; \mathbb{E}_{\theta\sim q_\phi}\big[F(\theta)\big]. \]

Content: Content: - Search device: when we sample \(\theta\sim q_\phi\), we generate the “population” of candidate networks we evaluate. - Smoothing device: \(J(\phi)\) is a smoothed version of \(F\), because it averages \(F\) locally according to \(q_\phi\).

A common “vanilla ES” choice is an isotropic Gaussian centred on a mean parameter vector \(\mu\): \[ q_\phi(\theta)=\mathcal{N}(\theta;\mu,\sigma^2 I), \quad \phi=(\mu,\sigma). \] Then the smoothed objective as a function of \(\mu\) is: \[ \tilde F(\mu) \;=\; \mathbb{E}_{\varepsilon\sim\mathcal{N}(0,I)}\big[F(\mu+\sigma\varepsilon)\big] \;=\; (F * \mathcal{N}(0,\sigma^2 I))(\mu), \] That is, a convolution with a Gaussian kernel: a bigger \(\sigma\) means heavier smoothing.

\(q_\phi\) is an algorithmic knob (where we probe and how widely we average), not a belief distribution.

6.1.1 Contrast with Bayesian posterior inference

In Bayesian inference, we specify a generative model and then compute a posterior. \[ p(\theta\mid D) \propto p(D\mid \theta)\,p(\theta), \] This represents uncertainty about \(\theta\) given data.

In ES/VO, we do not posit a likelihood \(p(D\mid\theta)\) (or a prior), and we do not try to approximate \(p(\theta\mid D)\). We just choose \(q_\phi\) so that:

sampling from it gives us candidate parameters to try, and
adjusting \(\phi\) increases \(\mathbb{E}_{\theta\sim q_\phi}[F(\theta)]\).

So \(q_\phi\) is a proposal/search distribution (and a smoothing kernel), not a posterior approximation.

If we later repurpose ES/NES to optimize an ELBO for variational Bayes, then we really do have a posterior target—but that’s a different use case than “ES as an optimizer.”

6.2 ES as KL optimization in distribution space

A clean way to think about it is that ES optimizes over a parametric search distribution \(p_\phi(\theta)\) rather than over a single point \(\theta\). We define expected fitness \[ J(\phi) = \mathbb{E}_{\theta \sim p_\phi}[F(\theta)] \] We then update \(\phi\) to increase \(J(\phi)\). Natural Evolution Strategies (NES) use the natural gradient for this update, which makes the step invariant to reparameterizations of \(\phi\) (Wierstra et al. 2008).

The Information-Geometric Optimization (IGO) framework generalizes this: it defines a canonical natural-gradient flow for any smooth parametric family \(p_\phi\), and recovers NES and variants of CMA-ES as special cases (Ollivier et al. 2017).

The “Bayesian flavour” here is geometric: we can view each update as nudging the distribution a small amount in KL/Fisher distance while improving expected (quantile-shaped) fitness. That is mathematically close to the trust-region and natural-gradient methods used in variational inference, even though we’re aiming for “good solutions” rather than a data-driven posterior.

6.3 ES as “variational optimization”

Another explicit bridge is variational optimization (VO): we replace a hard (possibly non-differentiable) objective \(F(\theta)\) with its expectation under a noise distribution. \[ \tilde F(\mu) = \mathbb{E}_{\varepsilon \sim q} [F(\mu + \sigma \varepsilon)], \] Then we optimize \(\tilde F\). This is exactly the Gaussian-smoothing view of ES, and we can write the gradient using a score-function identity (no backprop through \(F\)) (Staines and Barber 2012).

Why it looks like Bayes: VO and variational inference both optimize an objective defined by an expectation under a parameterized distribution, and both naturally use score-function and natural-gradient machinery. The difference is what we’re approximating: in VO/ES we’re not approximating a posterior \(p(\theta\mid \text{data})\); we’re using a distribution as a smoothing/search device.

6.4 Using NES/ES as a tool for Bayesian posterior inference

The most literal connection is using ES/NES as the gradient estimator inside variational Bayesian inference. Here the objective is explicitly Bayesian (e.g. the ELBO), but gradients may be hard to compute (non-reparameterizable families, awkward simulators, etc.). Amin (2023) proposes an NES-based black-box estimator for stochastic variational inference (including sequential settings), positioning NES as the workhorse for optimizing a variational posterior when standard low-variance gradients aren’t available.

7 Incoming

8 References

Amin. 2023. “Natural Evolution Strategies as a Black Box Estimator for Stochastic Variational Inference.”

Benhamou, Saltiel, Verel, et al. 2019. “BCMA-ES: A Bayesian Approach to CMA-ES.”

Beyer. 1995. “Toward a Theory of Evolution Strategies: Self-Adaptation.” Evolutionary Computation.

Beyer, and Schwefel. 2002. “Evolution Strategies – A Comprehensive Introduction.” Natural Computing.

Brochu, Cora, and Freitas. 2010. “A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.”

Floreano, and Mattiussi. 2008. Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies (Intelligent Robotics and Autonomous Agents).

Gallagher, Wood, Keith, et al. 2007. “Bayesian Inference in Estimation of Distribution Algorithms.” In 2007 IEEE Congress on Evolutionary Computation.

Glasmachers, Schaul, Yi, et al. 2010. “Exponential Natural Evolution Strategies.” In Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation. GECCO ’10.

Hansen. 2023. “The CMA Evolution Strategy: A Tutorial.”

Hansen, and Ostermeier. 2001. “Completely Derandomized Self-Adaptation in Evolution Strategies.” Evolutionary Computation.

Lenc, Elsen, Schaul, et al. 2019. “Non-Differentiable Supervised Learning with Evolution Strategies and Hybrid Methods.”

Ollivier, Arnold, Auger, et al. 2017. “Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles.” J. Mach. Learn. Res.

Rechenberg. 1978. “Evolutionsstrategien.” In Simulationsmethoden in Der Medizin Und Biologie. Medizinische Informatik Und Statistik.

Staines, and Barber. 2012. “Variational Optimization.”

Sun, Wierstra, Schaul, et al. 2009. “Efficient Natural Evolution Strategies.” In Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation. GECCO ’09.

Whitley, Starkweather, and Bogart. 1990. “Genetic Algorithms and Neural Networks: Optimizing Connections and Connectivity.” Parallel Computing.

Wierstra, Schaul, Peters, et al. 2008. “Natural Evolution Strategies.” In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).