Evolution strategies

Slimmest possible implementation of “evolutionary” optimization

2026-02-16 — 2026-02-18

Wherein a Gaussian-smoothed fitness is sought without backpropagation, by many parallel forward evaluations with antithetic perturbations, and updates are formed from scalar rewards.

Bayes
dynamical systems
likelihood free
linear algebra
machine learning
Monte Carlo
neural nets
nonparametric
particle
probability
sciml
signal processing
sparser than thou
statistics
statmech
uncertainty
distributed
optimization
probabilistic algorithms
Figure 1

It’s a nature-inspired ensemble strategy for optimization that uses evolution-like approaches. I ignored this for ages, but it includes many interesting and suggestive ideas. It’s also not much like other, more earnest attempts to copy evolutionary behaviour, such as the genetic programming approach to symbolic regression, which has not been that useful. Instead, it’s perhaps best thought of as a clever Monte Carlo method for optimization. It resembles Stein VGD if you squint at it funny.

ES (Evolution Strategies) is a relatively generic optimization method, but let’s talk about it in terms of neural networks, because that is what I’m thinking about, and also what the elderly literature needs updating for.

1 Evolution Strategies for beginners

If we’ve trained neural networks with backprop, we’re used to this workflow:

  1. Pick a loss \(L(\theta)\) for parameters \(\theta\).
  2. Compute \(\nabla_\theta L(\theta)\) by differentiating through the network.
  3. Update \(\theta \leftarrow \theta - \eta \nabla_\theta L(\theta)\).

Evolution Strategies are an alternative way of handling step 2, with the remarkable possibility of working when backprop is painful or impossible: when the system is non-differentiable, gradients are too noisy, the “model” includes discrete choices or simulator code, or we want to optimize something we only observe as an outcome (a score, a reward, an accuracy).

The neat trick is that ES can still look “gradient-like” in the end: we perturb parameters, evaluate the perturbed models, and combine those evaluations into an update that points (in expectation) in a direction of improvement. This technique was used in parallel with backprop for much of the field’s history, before backprop became dominant in the 2000s (Beyer 1995; Beyer and Schwefel 2002; Rechenberg 1978).

2 A running example

Let us consider the challenge of training a tiny neural net without backprop.

Suppose we have a simple classifier \(f_\theta(x)\) (say, a two-layer MLP). The standard supervised objective is:

\[ L(\theta) = \mathbb{E}_{(x,y)\sim \mathcal{D}}\big[\ell(f_\theta(x), y)\big]. \]

Backpropagation gives us \(\nabla_\theta L(\theta)\). ES assumes we won’t use it, and instead treats “run the model forward, then compute the loss” as a black box we can call:

  • Input: the parameters \(\theta\)
  • Output: a scalar score (loss, reward, accuracy, etc.)

To match the usual ES maximization framing, we define a fitness function \(F(\theta)\) that we maximize; for example:

\[ F(\theta) = -L(\theta). \]

The ES loop goes like this:

  1. Sample random perturbations \(\varepsilon_i\).
  2. Evaluate fitness at the perturbed parameters \(\theta + \sigma \varepsilon_i\).
  3. Combine the results into an update for \(\theta\).

Here \(\sigma>0\) is the “noise scale” (how far we probe).

3 Basic version

A standard choice is Gaussian perturbations (why is it always Gaussian, anyway? It feels like SGLD):

\[ \varepsilon \sim \mathcal{N}(0, I). \]

Let’s define the smoothed objective:

\[ J(\theta) = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}\big[F(\theta + \sigma \varepsilon)\big]. \]

Backprop optimizes \(F(\theta)\) directly (when it can). ES typically optimizes a Gaussian-smoothed version, \(J(\theta)\).

We’ve seen this before in NN training by Kalman filters, which also optimizes a smoothed objective, but with a fixed update ensemble.

We often point to that smoothing as having lots of benefits—it makes discontinuities and jagged landscapes more manageable, at the cost of a bias we control with \(\sigma\).

Fun fact: \(J(\theta)\) has a gradient even if \(F\) isn’t differentiable, and we can even write it without differentiating through \(F\):

\[ \nabla_\theta J(\theta) = \frac{1}{\sigma}\,\mathbb{E}_{\varepsilon}\big[F(\theta + \sigma \varepsilon)\,\varepsilon\big]. \]

This is a score-function / log-likelihood trick specialized to Gaussians. We can read this as:

  • “If perturbations in direction \(\varepsilon\) tend to increase fitness, we move \(\theta\) in direction \(\varepsilon\).”

A Monte Carlo estimate with \(N\) samples is:

\[ \hat{g} = \frac{1}{N\sigma}\sum_{i=1}^N F(\theta + \sigma \varepsilon_i)\,\varepsilon_i, \qquad \theta \leftarrow \theta + \eta \hat{g}. \]

4 Variance reduction by antithetic sampling

A variance-reduction trick we can almost always use is antithetic pairs. For each \(\varepsilon_i\), evaluate both \(\theta+\sigma\varepsilon_i\) and \(\theta-\sigma\varepsilon_i\). Then take the difference between them:

\[ \hat{g}_{\pm} = \frac{1}{2N\sigma}\sum_{i=1}^N \big(F(\theta + \sigma \varepsilon_i) - F(\theta - \sigma \varepsilon_i)\big)\,\varepsilon_i. \]

Intuition: symmetric noise cancels a lot of baseline effects and reduces estimator variance.

If we want a drop-in recipe for gradient descent:

  • Replace “backward pass” with “evaluate \(2N\) forward passes with perturbed weights”.
  • Replace “gradient tensor” with “weighted sum of perturbation tensors”.

What stays the same:

  • We still run forward passes.
  • We still optimize a scalar objective.
  • We still do iterative updates.

What changes:

  1. No computational graph / no chain rule. Our model can include discrete ops, integer arithmetic, if statements, simulators, non-differentiable reward shaping—anything.
  2. We pay in evaluations. ES typically needs a population of evaluations per update to get a decent gradient estimate.
  3. Parallelism looks different. Each population member is independent: evaluate on different devices, report back one scalar fitness value each, combine centrally (or via all-reduce on scalars).

A useful way to frame it:

  • Backprop cost is “one forward + one backward” (plus distributed gradient comms).
  • ES cost is “many forwards, no backward” (plus scalar comms).

So ES tends to make sense when

  • forward passes are cheap or massively parallelizable,
  • backward passes are awkward/expensive/impossible,
  • the objective is noisy or outcome-only,
  • we want robustness to nasty landscapes (e.g., long-horizon RL, recurrent instability).

And ES tends to be a bad deal when

  • gradients are available and stable,
  • we’re compute-limited and can’t parallelize lots of forwards,
  • we need the sample efficiency of gradient methods.

5 Practical ES details

An LLM advises me that I’ll need to stabilize variance using some (standard) tricks:

  1. Fitness normalization / shaping. Replace raw \(F_i\) with normalized scores (subtract mean and divide by std), or rank-transform them. This reduces sensitivity to outliers and scale.
  2. Common random numbers. Use the same minibatch/data for evaluating the population in one generation so that differences are due to parameters, not data noise. This sounds like a spanner in the works for me — see my argument with an LLM under large data.
  3. Control variates / baselines. The antithetic form is one; subtracting the mean fitness is another.
  4. Step size and \(\sigma\) scheduling. \(\sigma\) is both a smoothing radius and can affect both bias and variance.

So this \(\sigma\) is a free parameter. What does it do?

  • ES is optimizing \(J(\theta)=\mathbb{E}[F(\theta+\sigma\varepsilon)]\).
  • If \(\sigma\) is large, we are not optimizing the original objective tightly; we are optimizing a blurred version of it.
  • If \(\sigma\) is tiny, the estimator variance can blow up unless \(N\) is large.

6 ES meets Bayes

ES looks “Bayesian” in the sense that it maintains and updates a probability distribution. And it reuses many classic Bayesian sampling tricks such as quasi-Langevin dynamics. Let us make the correspondence more precise, though.

  1. Bayesian optimization (BO) puts a Bayesian posterior on the unknown objective function \(f(\cdot)\) (often a GP), then chooses evaluation points via an acquisition rule balancing exploration/exploitation (Brochu, Cora, and de Freitas 2010).
  2. Evolution Strategies (ES) (and especially NES/CMA-ES-style methods) put a distribution on candidate solutions \(\theta\) (or search steps), and update that distribution using fitness evaluations. There is typically no posterior over the function \(f\) itself.

What follows are some ways the “Bayes connection” has been made precise in the literature.

6.1 What is the “smoothing / search-device” distribution?

In vanilla ES (and in “variational optimization”), we introduce an auxiliary distribution over parameters and optimize the expected objective under that distribution. This distribution is not a Bayesian posterior but rather is something we choose as part of the optimizer.

Let \(F(\theta)\) be the fitness we want to maximize (e.g. negative loss). Pick a parametric search distribution \[ q_\phi(\theta) \] and define the ES/VO objective \[ J(\phi) \;=\; \mathbb{E}_{\theta\sim q_\phi}\big[F(\theta)\big]. \]

Content: - Search device: sampling \(\theta\sim q_\phi\) generates the “population” of candidate networks we evaluate. - Smoothing device: \(J(\phi)\) is a smoothed version of \(F\), because it averages \(F\) locally according to \(q_\phi\).

The common “vanilla ES” choice is an isotropic Gaussian centred on a mean parameter vector \(\mu\): \[ q_\phi(\theta)=\mathcal{N}(\theta;\mu,\sigma^2 I), \quad \phi=(\mu,\sigma). \] Then the smoothed objective, as a function of \(\mu\), is: \[ \tilde F(\mu) \;=\; \mathbb{E}_{\varepsilon\sim\mathcal{N}(0,I)}\big[F(\mu+\sigma\varepsilon)\big] \;=\; (F * \mathcal{N}(0,\sigma^2 I))(\mu), \] I.e., a convolution with a Gaussian kernel: bigger \(\sigma\) means heavier smoothing.

\(q_\phi\) is an algorithmic knob (where we probe, how widely we average), not a belief distribution.

6.1.1 Contrast with Bayesian posterior inference

In Bayesian inference, we specify a generative model and get a posterior. \[ p(\theta\mid D) \propto p(D\mid \theta)\,p(\theta), \] Which represents uncertainty about \(\theta\) given data.

In ES/VO we do not posit a likelihood \(p(D\mid\theta)\) (or a prior) and we do not try to approximate \(p(\theta\mid D)\). We just choose \(q_\phi\) so that:

  • sampling from it gives us candidate parameters to test, and
  • adjusting \(\phi\) increases \(\mathbb{E}_{\theta\sim q_\phi}[F(\theta)]\).

So \(q_\phi\) is a proposal/search distribution (and a smoothing kernel), not a posterior approximation.

If we later repurpose ES/NES to optimize an ELBO for variational Bayes, then we really do have a posterior target—but that’s a different use case than “ES as an optimizer.”

6.2 ES as KL optimization in distribution space

A clean formalism is to treat ES as optimizing over a parametric search distribution \(p_\phi(\theta)\) rather than over a single point \(\theta\). We define expected fitness \[ J(\phi) = \mathbb{E}_{\theta \sim p_\phi}[F(\theta)] \] We then update \(\phi\) to increase \(J(\phi)\). Natural Evolution Strategies (NES) use the natural gradient for this update, which makes the step invariant to reparametrizations of \(\phi\) (Wierstra et al. 2014).

The Information-Geometric Optimization (IGO) framework generalizes this: it defines a canonical natural-gradient flow for any smooth parametric family \(p_\phi\), and recovers NES and variants of CMA-ES as special cases (Ollivier et al. 2017).

The “Bayesian flavour” here is geometric: we can view each update as nudging the distribution a small amount in KL/Fisher distance while improving expected (quantile-shaped) fitness. That is mathematically close to the trust-region and natural-gradient methods used in variational inference, even though we’re aiming for “good solutions” rather than a data posterior.

6.3 ES as “variational optimization”

Another explicit bridge is variational optimization (VO): we replace a hard (possibly non-differentiable) objective \(F(\theta)\) with its expectation under a noise distribution. \[ \tilde F(\mu) = \mathbb{E}_{\varepsilon \sim q} [F(\mu + \sigma \varepsilon)], \] Then we optimize \(\tilde F\). This is exactly the Gaussian smoothing view of ES, and we can write the gradient using a score-function identity (no backprop through \(F\)) (Staines and Barber 2012).

Why it looks like Bayes: VO and variational inference both optimize an objective defined by an expectation under a parameterized distribution, and both naturally invoke score-function and natural-gradient machinery. The difference is what we’re approximating: in VO/ES we’re not approximating a posterior \(p(\theta\mid \text{data})\); we’re using a distribution as a smoothing / search device.

6.4 Using NES/ES as a tool for Bayesian posterior inference

The most literal connection is using ES/NES as the gradient estimator inside variational Bayesian inference. Here the objective is explicitly Bayesian (e.g. the ELBO), but gradients may be hard to get (non-reparameterizable families, awkward simulators, etc.). Amin (2023) proposes an NES-based black-box estimator for stochastic variational inference (including sequential settings), positioning NES as the workhorse for optimizing a variational posterior when standard low-variance gradients aren’t available.

7 Incoming

8 References

Amin. 2023. “Natural Evolution Strategies as a Black Box Estimator for Stochastic Variational Inference.”
Benhamou, Saltiel, Vérel, et al. 2019. “BCMA-ES: A Bayesian Approach to CMA-ES.”
Beyer. 1995. Toward a Theory of Evolution Strategies: Self-Adaptation.” Evolutionary Computation.
Beyer, and Schwefel. 2002. Evolution Strategies – A Comprehensive Introduction.” Natural Computing.
Brochu, Cora, and de Freitas. 2010. “A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning.”
Floreano, and Mattiussi. 2008. Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies (Intelligent Robotics and Autonomous Agents).
Gallagher, Wood, Keith, et al. 2007. Bayesian Inference in Estimation of Distribution Algorithms.” In Proceedings of the IEEE Congress on Evolutionary Computation (CEC).
Hansen. 2023. The CMA Evolution Strategy: A Tutorial.”
Hansen, and Ostermeier. 2001. Completely Derandomized Self-Adaptation in Evolution Strategies.” Evolutionary Computation.
Lenc, Elsen, Schaul, et al. 2019. Non-Differentiable Supervised Learning with Evolution Strategies and Hybrid Methods.”
Ollivier, Arnold, Auger, et al. 2017. “Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles.” Journal of Machine Learning Research.
Rechenberg. 1978. Evolutionsstrategien.” In Simulationsmethoden in Der Medizin Und Biologie. Medizinische Informatik Und Statistik.
Staines, and Barber. 2012. Variational Optimization.”
Vanchurin, Wolf, Katsnelson, et al. 2021. Towards a Theory of Evolution as Multilevel Learning.”
Whitley, Starkweather, and Bogart. 1990. Genetic Algorithms and Neural Networks: Optimizing Connections and Connectivity.” Parallel Computing.
Wierstra, Schaul, Glasmachers, et al. 2014. “Natural Evolution Strategies.” Journal of Machine Learning Research.