Adaptive design of experiments

I am not going to call it ‘Bayesian optimization’, but that is what everyone else does

2017-04-11 — 2025-02-27

functional analysis

how do science

model selection

optimization

surrogate

when to compute

Suspiciously similar content

Closely connected to AutoML because surrogate optimisation is quite popular for this, and likewise Bayesian model calibration.

tl;dr

Unless improving BO algorithms themselves, or working with a large (100+) number of dimensions, I usually recommend people use off-the-shelf Ax and don’t worry about the fine details. It has a good API, and it’s powerful. Documentation is improving, and the project is active. It can be deployed on real labs, virtual experiments, and various weird clusters. By default, it usually just works.

1 Problem statement

Depending on your allegiance to hipness, you might credit the original statement of the problem to either Chernoff (1959) or Močkus (1975). Let’s go with the friendly modern version from Gilles Louppe and Manoj Kumar:

We are interested in solving

$x^{*} = \arg min_{x} f (x)$

under the constraints that

$f$ is a black box for which no closed form is known (nor its gradients);

$f$ is expensive to evaluate;

evaluations of $y = f (x)$ may be noisy.

We might imagine sometimes having access to gradients. In such cases, we will additionally say that, rather than observing $\nabla f, \nabla^{2} f$ , we observe random variables $G (x), H (x)$ with $E G = \nabla f$ and $E (H) = \nabla^{2} f$ , as in stochastic optimization.

This setup is similar to reinforcement learning problems with a similar explore/exploit trade-off, though I don’t know the exact disciplinary boundaries.

The typical setup here is: We use a surrogate model of the loss surface and optimise that, aiming for a computationally cheaper alternative than evaluating the whole loss surface. An artfully chosen surrogate model can estimate where to sample next, predict unseen loss values, and possibly even give uncertainty estimates.

When the surrogate model is a Bayesian posterior over parameter values we want to learn, it’s often called “Bayesian optimisation.” Gaussian process regression is often used to approximate the loss surface. This isn’t crazy. Early work on GP regression (Krige 1951) was already somewhat optimisation-adjacent.

However, GP regressions aren’t the only possible surrogate models, not even the only possible Bayesian ones, and there’s nothing innately Bayesian about estimating unknown functions. So, there are several ways we can adjust from the default. Setting that issue aside, see Apoorv Agnihotri, Nipun Batra, Exploring Bayesian Optimization for a well-illustrated journey into this field.

Fashionable use: hyperparameter/ model selection, e.g., regularising complex models, often called automl.

We could also use adaptive experiments outside simulations, such as in industrial process control, real labs, mine shafts, and more. I first noticed this idea in sequential ANOVA design. Even though it’s not nearly so hip now, it’s still an incredible idea years after its inception.

Further info in Roman Garnett’s Bayesian Optimization Book (Garnett 2023).

2 Adaptive stopping only

See sequential hypothesis testing.

3 With side information

e.g. SEBO (Chan, Paulson, and Mesbah 2023; S. Liu et al. 2023). To be continued.

4 BORE

Bayesian optimization by density ratio estimation (Oliveira, Tiao, and Ramos 2022; Louis C. Tiao et al. 2021).

Bayesian optimization (BO) is among the most effective and widely-used black-box optimization methods. BO proposes solutions according to an explore-exploit trade-off criterion encoded in an acquisition function, many of which are computed from the posterior predictive of a probabilistic surrogate model. Prevalent among these is the expected improvement (EI). The need to ensure analytical tractability of the predictive often poses limitations that can hinder the efficiency and applicability of BO. In this paper, we cast the computation of EI as a binary classification problem, building on the link between class-probability estimation and density-ratio estimation, and the lesser-known link between density ratios and EI. By circumventing the tractability constraints, this reformulation provides numerous advantages, not least in terms of expressiveness, versatility, and scalability.

5 Lab bandits

Sequential experiment design in the lab.

6 Acquisition functions

More useful terminology: Active learning, acquisition functions. To be continued.

For now, see BoTorch custom acquisition for an explanation by example.

7 Connection to RL

To be determined.

8 Wacky

Adaptive design methods I don’t understand because they look not so much black box as out of the box. Quasi-oppositional Differential Evolution (Rahnamayan, Tizhoosh, and Salama 2007) is old and comes from a zany field that cites compass points and Yin-Yang as inspiration (Mahdavi, Rahnamayan, and Deb 2018). Supposedly, it’s powerful and robust (“Dagstuhloid Benchmarking” 2023). What’s going on here?

9 Over large discrete sequences

Challenging for many BO methods but vital in, e.g. biological ML. I’ve seen some interesting ones in this space (González-Duque et al. 2024; Stanton et al. 2024).

Benchmarking HDBO summarises SOTA for life sciences. See poli.

10 Implementations

10.1 BoTorch/Ax

Botorch is the pytorch-based Bayesian optimization toolbox used by Ax, which is an experiment designer, wrapped up in a nice API.

Ax is a platform for optimising any kind of experiment, including machine learning experiments, A/B tests, and simulations. Ax can optimise discrete configurations (e.g., variants of an A/B test) using multi-armed bandit optimization and continuous (e.g., integer or floating point)-valued configurations using Bayesian optimization. This makes it suitable for many applications.

Ax has been used for various product, infrastructure, ML, and research applications at Facebook.

I wrote a script to run this on a slurm cluster: Ax + SLURM via submitit and asyncio.

10.2 Nevergrad

Nevergrad - A gradient-free optimization platform

It looks similar to Ax, but I haven’t used it, so I can’t say how it compares.

10.3 Poli

MachineLearningLifeScience/poli-baselines: A collection of objective functions and black box optimization algorithms related to proteins and small molecules
MachineLearningLifeScience/poli: A library of discrete objectives

This is probably what you want if the problem involves optimising long sequences, like DNA strands or sentences.

poli-baselines has many algorithms:

Name	Reference
Random Mutations	N/A
Random hill-climbing	N/A
CMA-ES	pycma
(Fixed-length) Genetic Algorithm	pymoo’s implementation
Hvarfner’s Vanilla BO	Hvarfner et al. 2024
Bounce	Papenmeier et al. 2023
BAxUS	Papenmeier et al. 2022
Probabilistic Reparametrization	Daulton et al. 2022
SAASBO	Eriksson and Jankowiak 2021
ALEBO	Lentham et al. 2020
LaMBO2	Gruver and Stanton et al. 2020

[…] This library works well with the discrete objective functions in poli. One example is the ALOHA problem, involving searching 5-letter sequences to spell “ALOHA”. Here’s how to use the RandomMutation solver inside poli-baselines:

from poli.objective_repository import AlohaProblemFactory
from poli_baselines.solvers import RandomMutation

# Create an instance of the problem
problem = AlohaProblemFactory().create()
f, x0 = problem.black_box, problem.x0
y0 = f(x0)

# Create an instance of the solver
solver = RandomMutation(
    black_box=f,
    x0=x0,
    y0=y0,
)

# Run the optimisation for 1000 steps,
# breaking if we find a performance above 5.0.
solver.solve(max_iter=1000, break_at_performance=5.0)

# Check if we got the solution we wanted
print(solver.get_best_solution())  # Should be [["A", "L", "O", "H", "A"]]

10.4 `skopt`

skopt (aka scikit-optimize)

[…] is a simple and efficient library to minimise (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimisation.

This belongs to the sklearn family, meaning it works well, reliably, predictably, and has amazing tooling, but it’s not fast, and lacks recent enhancements.

10.5 Dragonfly

Dragonfly

…is an open source Python library for scalable Bayesian optimisation.

Bayesian optimization optimises expensive black-box functions. Dragonfly offers tools to scale up Bayesian optimization for large problems, with features for high-dimensional optimization, parallel evaluations, multi-fidelity optimization, and multi-objective optimisation.

It’s written in Python and Fortran, open-source.

10.6 PySOT

PySOT

The Surrogate Optimization Toolbox (pySOT) for global deterministic optimisation problems. pySOT is hosted on GitHub

The main purpose is to optimise expensive black-box objective functions with continuous and/or integer variables, where all variables have bound constraints. The tighter the bounds, the more efficient the algorithms. This toolbox is less efficient for tasks with cheap evaluations.

With many surrogate options, a long history, and cool features like automatic concurrency, it hasn’t been updated for years. Perhaps that’s why it’s fallen from favour (Krityakierne, Akhtar, and Shoemaker 2016; Regis and Shoemaker 2013, 2009, 2007). It’s not strong on Bayesian optimisation interpretation.

10.7 GPyOpt

GPyOpt

Gaussian process Optimization using GPy. Performs global optimization with different acquisition functions. You can use GPyOpt to optimise physical experiments (sequentially or in batches) and tune ML algorithms. It’s excellent at handling large datasets through sparse Gaussian process models.

Created by the same lab at Sheffield that brought us GPy.

10.8 Sigopt

sigopt is a commercial product that likely delivers impressive results. Given no pricing information on their website, one suspects it’s quite pricey.

10.9 spearmint

spearmint/spearmint2:

Spearmint is a package for Bayesian optimisation based on (Snoek, Larochelle, and Adams 2012).

The code consists of several parts and is modular, allowing for various ‘driver’ and ‘chooser’ modules. The ‘choosers’ are implementations of acquisition functions like expected improvement or random. The drivers manage experiment distribution and execution on the system. Designed for running parallel experiments (launching new experiments as soon as results come in), it requires some engineering know-how.

Spearmint2 is similar but fancier and more recently updated; however, it has a restrictive licence that prohibits wide redistribution without paying fees. You may or may not want to trust the development and support implied by four Harvard Professors, depending on your application.

Both of the Spearmint options (especially the latter) have opinionated choices of technology stack for their optimizations. This means they can do more for you but require more setup than something simple like skopt. Depending on your computing environment, this might be an overall plus or minus.

10.10 SMAC

SMAC (AGPLv3)/Python SMAC3.

(sequential model-based algorithm configuration) is a versatile tool for optimising algorithm parameters (or the parameters of some other process we can run automatically or a function we can evaluate, such as a simulation).

SMAC has helped us speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions. Recently, we have also found it to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms. Finally, the predictive models SMAC is based on can also capture and exploit important information about the model domain, such as which input variables are most important.

We hope you find SMAC similarly useful. Ultimately, we hope that it helps algorithm designers focus on tasks that are more scientifically valuable than parameter tuning.

11 Incoming

Acquisition functions in Bayesian Optimization | Let’s talk about science!

12 References

Allen-Zhu, Li, Singh, et al. 2017. “Near-Optimal Design of Experiments via Regret Minimization.” In PMLR.

———, et al. 2021. “Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach.” Mathematical Programming.

Ament, Daulton, Eriksson, et al. 2024. “Unexpected Improvements to Expected Improvement for Bayesian Optimization.”

Chaloner, and Verdinelli. 1995. “Bayesian Experimental Design: A Review.” Statistical Science.

Chan, Paulson, and Mesbah. 2023. “Safe Explorative Bayesian Optimization - Towards Personalized Treatments in Plasma Medicine.” In 2023 62nd IEEE Conference on Decision and Control (CDC).

Chernoff. 1959. “Sequential Design of Experiments.” The Annals of Mathematical Statistics.

Chowdhury, and Gopalan. 2017. “On Kernelized Multi-Armed Bandits.” In Proceedings of the 34th International Conference on Machine Learning.

“Dagstuhloid Benchmarking.” 2023.

Dani, Hayes, and Kakade. 2007. “The Price of Bandit Information for Online Optimization.” In Proceedings of the 20th International Conference on Neural Information Processing Systems. NIPS’07.

———. 2008. “Stochastic Linear Optimization Under Bandit Feedback.” 21st Annual Conference on Learning Theory.

Feurer, Klein, Eggensperger, et al. 2015. “Efficient and Robust Automated Machine Learning.” In Advances in Neural Information Processing Systems 28.

Foster, Dylan J., Han, Qian, et al. 2024. “Online Estimation via Offline Estimation: An Information-Theoretic Framework.”

Foster, Adam, Jankowiak, Bingham, et al. 2020. “Variational Bayesian Optimal Experimental Design.” arXiv:1903.05480 [Cs, Stat].

Franceschi, Donini, Frasconi, et al. 2017. “On Hyperparameter Optimization in Learning Systems.” In.

Frazier. 2018. “A Tutorial on Bayesian Optimization.”

Garnett. 2023. Bayesian Optimization.

Gelbart, Snoek, and Adams. 2014. “Bayesian Optimization with Unknown Constraints.” In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence. UAI’14.

González-Duque, Michael, Bartels, et al. 2024. “A Survey and Benchmark of High-Dimensional Bayesian Optimization of Discrete Sequences.” In.

Grumitt, Karamanis, and Seljak. 2023. “Flow Annealed Kalman Inversion for Gradient-Free Inference in Bayesian Inverse Problems.”

Grünewälder, Audibert, Opper, et al. 2010. “Regret Bounds for Gaussian Process Bandit Problems.” In.

Hennig, Osborne, and Kersting. 2022. Probabilistic Numerics: Computation as Machine Learning.

Higdon, Gattiker, Williams, et al. 2008. “Computer Model Calibration Using High-Dimensional Output.” Journal of the American Statistical Association.

Hooten, Leeds, Fiechter, et al. 2011. “Assessing First-Order Emulator Inference for Physical Parameters in Nonlinear Mechanistic Models.” Journal of Agricultural, Biological, and Environmental Statistics.

Hutter, Hoos, and Leyton-Brown. 2011. “Sequential Model-Based Optimization for General Algorithm Configuration.” In Learning and Intelligent Optimization. Lecture Notes in Computer Science.

Hutter, Hoos, and Leyton-Brown. 2013. “An Evaluation of Sequential Model-Based Optimization for Expensive Blackbox Functions.” In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation. GECCO ’13 Companion.

Jagalur-Mohan, and Marzouk. 2021. “Batch Greedy Maximization of Non-Submodular Functions: Guarantees and Applications to Experimental Design.” Journal of Machine Learning Research.

Krige. 1951. “A Statistical Approach to Some Basic Mine Valuation Problems on the Witwatersrand.” Journal of the Southern African Institute of Mining and Metallurgy.

Krityakierne, Akhtar, and Shoemaker. 2016. “SOP: Parallel Surrogate Global Optimization with Pareto Center Selection for Computationally Expensive Single Objective Problems.” Journal of Global Optimization.

Kushner. 1964. “A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise.” Journal of Basic Engineering.

Li, Jamieson, DeSalvo, et al. 2017. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.” The Journal of Machine Learning Research.

Liu, Sulin, Feng, Eriksson, et al. 2023. “Sparse Bayesian Optimization.” In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics.

Liu, Yi, and Ročková. 2023. “Variable Selection Via Thompson Sampling.” Journal of the American Statistical Association.

Mahdavi, Rahnamayan, and Deb. 2018. “Opposition Based Learning: A Literature Review.” Swarm and Evolutionary Computation.

Matheron. 1963a. Traité de Géostatistique Appliquée. 2. Le Krigeage.

———. 1963b. “Principles of Geostatistics.” Economic Geology.

Močkus. 1975. “On Bayesian Methods for Seeking the Extremum.” In Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1–7, 1974. Lecture Notes in Computer Science.

Müller, Feurer, Hollmann, et al. 2023. “PFNs4BO: In-Context Learning for Bayesian Optimization.” arXiv Preprint arXiv:2305.17535.

O’Hagan. 1978. “Curve Fitting and Optimal Design for Prediction.” Journal of the Royal Statistical Society: Series B (Methodological).

Oliveira, Ott, and Ramos. 2021. “No-Regret Approximate Inference via Bayesian Optimisation.” In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence.

Oliveira, Tiao, and Ramos. 2022. “Batch Bayesian Optimisation via Density-Ratio Estimation with Guarantees.” In Advances in Neural Information Processing Systems.

Rahnamayan, Tizhoosh, and Salama. 2007. “Quasi-Oppositional Differential Evolution.” In 2007 IEEE Congress on Evolutionary Computation.

Regis, and Shoemaker. 2007. “A Stochastic Radial Basis Function Method for the Global Optimization of Expensive Functions.” INFORMS Journal on Computing.

———. 2009. “Parallel Stochastic Global Optimization Using Radial Basis Functions.” INFORMS Journal on Computing.

———. 2013. “Combining Radial Basis Function Surrogates and Dynamic Coordinate Search in High-Dimensional Expensive Black-Box Optimization.” Engineering Optimization.

Ryan, Drovandi, McGree, et al. 2016. “A Review of Modern Computational Algorithms for Bayesian Optimal Design.” International Statistical Review / Revue Internationale de Statistique.

Sacks, Schiller, and Welch. 1989. “Designs for Computer Experiments.” Technometrics.

Sacks, Welch, Mitchell, et al. 1989. “Design and Analysis of Computer Experiments.” Statistical Science.

Shahriari, Swersky, Wang, et al. 2016. “Taking the Human Out of the Loop: A Review of Bayesian Optimization.” Proceedings of the IEEE.

Snoek, Larochelle, and Adams. 2012. “Practical Bayesian Optimization of Machine Learning Algorithms.” In Advances in Neural Information Processing Systems.

Snoek, Swersky, Zemel, et al. 2014. “Input Warping for Bayesian Optimization of Non-Stationary Functions.” In Proceedings of the 31st International Conference on Machine Learning (ICML-14).

Srinivas, Krause, Kakade, et al. 2010. “Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design.” In Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10.

Staines, and Barber. 2012. “Variational Optimization.”

———. 2013. “Optimization by Variational Bounding.” Computational Intelligence.

Stanton, Alberstein, Frey, et al. 2024. “Closed-Form Test Functions for Biophysical Sequence Optimization Algorithms.”

Steinberg, Oliveira, Ong, et al. 2024. “Variational Search Distributions.”

Swersky, Rubanova, Dohan, et al. 2020. “Amortized Bayesian Optimization over Discrete Spaces.” In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI).

Swersky, Snoek, and Adams. 2013. “Multi-Task Bayesian Optimization.” In Advances in Neural Information Processing Systems 26.

Tiao, Louis C, Klein, Archambeau, et al. 2020. “Bayesian Optimization by Density Ratio Estimation.” In.

Tiao, Louis C., Klein, Seeger, et al. 2021. “BORE: Bayesian Optimization by Density-Ratio Estimation.” In Proceedings of the 38th International Conference on Machine Learning.

Wang, Dahl, Swersky, et al. 2022. “Pre-Training Helps Bayesian Optimization Too.”

Wang, Kim, and Kaelbling. 2018. “Regret Bounds for Meta Bayesian Optimization with an Unknown Gaussian Process Prior.”

Wilson, Hutter, and Deisenroth. 2018. “Maximizing Acquisition Functions for Bayesian Optimization.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS’18.

Wilson, Moriconi, Hutter, et al. 2017. “The Reparameterization Trick for Acquisition Functions.”

Zaballa, and Hui. 2025. “Optimizing Likelihoods via Mutual Information: Bridging Simulation-Based Inference and Bayesian Optimal Experimental Design.”