Closely related to AutoML, in that surrogate optimisation is a popular tool for such, and likewise Bayesian model calibration.

## Problem statement

According to Gilles Louppe and Manoj Kumar:

We are interested in solving

\[x^* = \arg \min_x f(x)\]

under the constraints that

- \(f\) is a black box for which no closed form is known (nor its gradients);
- \(f\) is expensive to evaluate;
- evaluations of \(y=f(x)\) may be noisy.

It is possible to imagine we might even have access to gradients sometimes in which case we will additionally say that, rather than observing \(\nabla f, \nabla^2 f\) we observe some random variables \(G(x),H(x)\) with \(\mathbb{E}G=\nabla f\) and \(\mathbb{E}(H)=\nabla^2 f,\) as in stochastic optimisation.

This is similar to the typical framing of reinforcement learning problems where there is a similar explore/exploit trade-off, although I do not know the precise disciplinary boundaries that may transect these areas.

The typical setup here is: We use a surrogate model of the loss surface and optimise that, on the hope that it be computationally cheaper than evaluating the whole loss surface. An artfully chosen surrogate model can choose where to sample next and so on and estimate unseen loss values, and possibly even give uncertainty estimatess.

When the surrogate model is in particular a Bayesian posterior over parameter values that we wish to learn, a common name is the βBayesian optimisationβ. Gaussian process regression is an obvious method to approximate the loss surface in this case, and this seems to be assumed typically.

This is not crazy. Some of the early work on GP regression (Krige 1951) already includes a surrogate modelling application β How much ore remains in my mine, given my observations? However, GP regressions are not the only possible surrogate models, not even the only possible Bayesian one, and there is nothing intrinsically Bayesian about estimating the unknown function.

Setting that quibble aside, see Apoorv Agnihotri, Nipun Batra, Exploring Bayesian Optimization for a well-illustrated journey into this field.

Fashioable use: hyperparameter/ model selection, in e.g. regularising complex models, which is compactly referred to these days as automl.

You could also obviously use adaptive experiments outside of simulations, e.g. in industrial process control, which is where I originally saw this kind of thing, in the form of sequential ANOVA design, which is an incredible idea itself, although that is now years old so is not nearly so hip. This page is full of things that we migh describe as, effectively, nonlinear, heteroskedastic sequential ANOVA,

## Lab bandits

Sequential experiment design in the lab.

## Acquisition functions

Active learning, acquisition functions. TBD.

## Connection to RL

TBD.

## Implementation

`skopt`

skopt (aka `scikit-optimize`

)

[β¦] is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization.

This is a member of the `sklearn`

club which is to say it works well, reliably, predictably, universally has amazing tooling, but is not that fast and few modern fancy fripperies.

### Dragonfly

β¦is an open source python library for scalable Bayesian optimisation.

Bayesian optimisation is used for optimising black-box functions whose evaluations are usually expensive. Beyond vanilla optimisation techniques, Dragonfly provides an array of tools to scale up Bayesian optimisation to expensive large scale problems. These include features/functionality that are especially suited for high dimensional optimisation (optimising for a large number of variables), parallel evaluations in synchronous or asynchronous settings (conducting multiple evaluations in parallel), multi-fidelity optimisation (using cheap approximations to speed up the optimisation process), and multi-objective optimisation (optimising multiple functions simultaneously).

Python and Fortran, open-source.

### PySOT

Surrogate Optimization Toolbox (pySOT) for global deterministic optimization problems. pySOT is hosted on GitHub

The main purpose of the toolbox is for optimization of computationally expensive black-box objective functions with continuous and/or integer variables. All variables are assumed to have bound constraints in some form where none of the bounds are infinity. The tighter the bounds, the more efficient are the algorithms since it reduces the search region and increases the quality of the constructed surrogate. This toolbox may not be very efficient for problems with computationally cheap function evaluations. Surrogate models are intended to be used when function evaluations take from several minutes to several hours or more.

This has a huge variety of different surrogate options, a long history and promises to parallel asynchronous, but is not especially famous for some reason? (quality?) Based on (Krityakierne, Akhtar, and Shoemaker 2016; Regis and Shoemaker 2013, 2009, 2007). It is one of the ones that does not particularly emphasis Bayesian methods.

### GPyOpt

Gaussian process optimization using GPy. Performs global optimization with different acquisition functions. Among other functionalities, it is possible to use GPyOpt to optimize physical experiments (sequentially or in batches) and tune the parameters of Machine Learning algorithms. It is able to handle large data sets via sparse Gaussian process models.

By the same lab at Sheffield that brough us GPy.

### Sigopt

sigopt is a commercial product that presumably does a good job. The fact their website does not give even a hint of the price leads me to suspect they are extremely expensive.

### BoTorch/Ax

Botorch is the pytorch-based Bayesian optimization toolbox used by Ax which is an experiment designer.

Ax is a platform for optimizing any kind of experiment, including machine learning experiments, A/B tests, and simulations. Ax can optimize discrete configurations (e.g., variants of an A/B test) using multi-armed bandit optimization, and continuous (e.g., integer or floating point)-valued configurations using Bayesian optimization. This makes it suitable for a wide range of applications.

Ax has been successfully applied to a variety of product, infrastructure, ML, and research applications at Facebook.

### spearmint

Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper (Snoek, Larochelle, and Adams 2012)

The code consists of several parts. It is designed to be modular to allow swapping out various βdriverβ and βchooserβ modules. The βchooserβ modules are implementations of acquisition functions such as expected improvement, UCB or random. The drivers determine how experiments are distributed and run on the system. As the code is designed to run experiments in parallel (spawning a new experiment as soon a result comes in), this requires some engineering.

Spearmint2 is similar, but more recently updated and fancier; however it has a restrictive license prohibiting wide redistribution without the payment of fees. You may or may not wish to trust the implied level of development and support of 4 Harvard Professors, depending on your application.

Both of the Spearmint options (especially the latter) have opinionated
choices of technology stack in order to do their optimizations, which means
they can do more work for you, but require more setup, than a simple little
thing like `skopt`

.
Depending on your computing environment this might be an overall plus or a
minus.

### SMAC

(sequential model-based algorithm configuration) is a versatile tool for optimizing algorithm parameters (or the parameters of some other process we can run automatically, or a function we can evaluate, such as a simulation).

SMAC has helped us speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions. Recently, we have also found it to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms. Finally, the predictive models SMAC is based on can also capture and exploit important information about the model domain, such as which input variables are most important.

We hope you find SMAC similarly useful. Ultimately, we hope that it helps algorithm designers focus on tasks that are more scientifically valuable than parameter tuning.

## Incoming

## References2023-09-25T18:41:10+10:00

*PMLR*, 126β35.

*Mathematical Programming*186 (1): 439β78.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2962β70. Curran Associates, Inc.

*arXiv:1903.05480 [Cs, Stat]*, January.

*Bayesian Optimization*. 1st edition. Cambridge University Press.

*Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence*, 250β59. UAIβ14. Arlington, Virginia, United States: AUAI Press.

*Journal of the American Statistical Association*103 (482): 570β83.

*Journal of Agricultural, Biological, and Environmental Statistics*16 (4): 475β94.

*Learning and Intelligent Optimization*, 6683:507β23. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, Berlin, Heidelberg.

*Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation*, 1209β16. GECCO β13 Companion. New York, NY, USA: ACM.

*Journal of Machine Learning Research*22: 62.

*Journal of the Southern African Institute of Mining and Metallurgy*52 (6): 119β39.

*Journal of Global Optimization*66 (3): 417β37.

*The Journal of Machine Learning Research*18 (1): 6765β6816.

*TraitΓ© de GΓ©ostatistique AppliquΓ©e. 2. Le Krigeage*. Editions Technip.

*Economic Geology*58 (8): 1246β66.

*Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1β7, 1974*, edited by G. I. Marchuk, 400β404. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

*Journal of the Royal Statistical Society: Series B (Methodological)*40 (1): 1β24.

*INFORMS Journal on Computing*19 (4): 497β509.

*INFORMS Journal on Computing*21 (3): 411β26.

*Engineering Optimization*45 (5): 529β55.

*Technometrics*31 (1): 41β47.

*Statistical Science*4 (4): 409β23.

*Advances in Neural Information Processing Systems*, 2951β59. Curran Associates, Inc.

*Proceedings of the 31st International Conference on Machine Learning (ICML-14)*, 1674β82.

*IEEE Transactions on Information Theory*58 (5): 3250β65.

*arXiv:1212.4507 [Cs, Stat]*, December.

*Advances in Neural Information Processing Systems 26*, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 2004β12. Curran Associates, Inc.

## References

*PMLR*, 126β35.

*Mathematical Programming*186 (1): 439β78.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2962β70. Curran Associates, Inc.

*arXiv:1903.05480 [Cs, Stat]*, January.

*Bayesian Optimization*. 1st edition. Cambridge University Press.

*Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence*, 250β59. UAIβ14. Arlington, Virginia, United States: AUAI Press.

*Journal of the American Statistical Association*103 (482): 570β83.

*Journal of Agricultural, Biological, and Environmental Statistics*16 (4): 475β94.

*Learning and Intelligent Optimization*, 6683:507β23. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, Berlin, Heidelberg.

*Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation*, 1209β16. GECCO β13 Companion. New York, NY, USA: ACM.

*Journal of Machine Learning Research*22: 62.

*Journal of the Southern African Institute of Mining and Metallurgy*52 (6): 119β39.

*Journal of Global Optimization*66 (3): 417β37.

*The Journal of Machine Learning Research*18 (1): 6765β6816.

*TraitΓ© de GΓ©ostatistique AppliquΓ©e. 2. Le Krigeage*. Editions Technip.

*Economic Geology*58 (8): 1246β66.

*Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1β7, 1974*, edited by G. I. Marchuk, 400β404. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

*Journal of the Royal Statistical Society: Series B (Methodological)*40 (1): 1β24.

*INFORMS Journal on Computing*19 (4): 497β509.

*INFORMS Journal on Computing*21 (3): 411β26.

*Engineering Optimization*45 (5): 529β55.

*Technometrics*31 (1): 41β47.

*Statistical Science*4 (4): 409β23.

*Advances in Neural Information Processing Systems*, 2951β59. Curran Associates, Inc.

*Proceedings of the 31st International Conference on Machine Learning (ICML-14)*, 1674β82.

*IEEE Transactions on Information Theory*58 (5): 3250β65.

*arXiv:1212.4507 [Cs, Stat]*, December.

*Advances in Neural Information Processing Systems 26*, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 2004β12. Curran Associates, Inc.

## No comments yet. Why not leave one?