This article was originally split off from autoML, although neither topic is a strict subset of the other.

The art of choosing the best hyperparameters for a ML modelβs algorithms, of which there may be many.

Should one bother getting fancy about this? Ben Recht argues that often random search is competitive with highly tuned Bayesian methods in hyperparameter tuning. Kevin Jamieson argues you can be cleverer than that though. Letβs inhale some hype.

## Tracking and choosing hyperparameters

In practice this hyperparameter thing is integrated with the problem both of configuring ML and of tracking progress; See also those pages for practical implementation notes.

## Bayesian/surrogate optimisation

Loosely, we think of interpolating between observations of a loss surface and guessing where the optimal point is. See Bayesian optimisation. This is generic. Not as popular in practice as I might have assumed because it turns out to be fairly greedy with data and does not exploit problem-specific ideas, such as early stopping, which is saves time and is in any case a useful type of neural net regularisation.

HT Cheng Soon Ong for pointing out Why machine learning algorithms are hard to tune (and the fix). His summary cuts to the core:

Machine learning hyperparameters are hard to tune. One way to think of why it is hard, is because it is a Pareto front of multiple objectives. One way to solve that problem is to look at Lagrange multipliers, as proposed by a paper in 1988. (Platt and Barr 1987)

A synoptic overview of the trendiest strategies can be found in Peter Cottonβs microprediction/humpday: Elo ratings for global black box derivative-free optimizers:

Behold! Fifty strategies assigned Elo ratings depending on dimension of the problem and number of function evaluations allowed.

Hello and welcome to HumpDay, a package that helps you choose a Python global optimizer package, and strategy therein, from Ax-Platform, bayesian-optimization, DLib, HyperOpt, NeverGrad, Optuna, Platypus, PyMoo, PySOT, Scipy classic and shgo, Skopt, nlopt, Py-Bobyaq, UltraOpt and maybe others by the time you read this. It also presents

someof their functionality in a common calling syntax.

The introductory blog posts are enlightening:

## Differentiable hyperparameter optimisation

## Completely random search

Just what you would think.

## Adaptive random search

Now it comes in an adaptive flavour that leverages the SGD fitting method e.g. Liam Li et al. (2020). called hyperband Lisha Li et al. (2017)/ ASHA.

## Implementations

Most of the implementations use, internally, a surrogate model for parameter tuning, but wrap it with some tools to control and launch experiments in parallel, early termination etc.

Arranged so that the top few are hyped and popular and after that are less renowed hipster options.

Not yet filed:

- Keras Tuner
- Tune: Scalable Hyperparameter Tuning β Ray v2.0.0.dev0
- Welcome To Neural Network Intelligence !!! β An open source AutoML toolkit for neural architecture search, model compression and hyper-parameter tuning (NNI v2.0)
- AutoGluon: AutoML Toolkit for Deep Learning β AutoGluon Documentation 0.0.14 documentation

### Determined

determined includes hyperparameter tuning which is not in fact a surrogate surface, but an early stopping pruning of crappy models in a random search, i.e. fancy random search.

### Ray

Tune is a Python library for experiment execution and hyperparameter tuning at any scale. Core features:

- Launch a multi-node distributed hyperparameter sweep in less than 10 lines of code.
- Supports any machine learning framework, including PyTorch, XGBoost, MXNet, and Keras.
- Automatically manages checkpoints and logging to TensorBoard.
- Choose among state of the art algorithms such as Population Based Training (PBT), BayesOptSearch, HyperBand/ASHA. (Liam Li et al. 2020)

### Optuna

optuna (Akiba et al. 2019) supports fancy neural net training; similar to hyperopt AFAICT except that is supports Covariance Matrix Adaptation, whatever that is ? (see Hansen (2016)).

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Thanks to our define-by-run API, the code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.

### hyperopt.py

`hyperopt`

J. Bergstra, Yamins, and Cox (2013)

is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.

Currently two algorithms are implemented in hyperopt:

- Random Search
- Tree of Parzen Estimators (TPE)
Hyperopt has been designed to accommodate Bayesian optimization algorithms based on Gaussian processes and regression trees, but these are not currently implemented.

All algorithms can be run either serially, or in parallel by communicating via MongoDB or Apache Spark

## Hyperopt.jl

### auto-sklearn

auto-sklearn has recently been upgraded. Details TBD (Feurer et al. 2020).

### skopt

`skopt`

(aka `scikit-optimize`

)

[β¦] is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization.

### spearmint

Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper (Snoek, Larochelle, and Adams 2012).

The code consists of several parts. It is designed to be modular to allow swapping out various βdriverβ and βchooserβ modules. The βchooserβ modules are implementations of acquisition functions such as expected improvement, UCB or random. The drivers determine how experiments are distributed and run on the system. As the code is designed to run experiments in parallel (spawning a new experiment as soon a result comes in), this requires some engineering.

`Spearmint2`

is similar, but more recently
updated and fancier; however it has a restrictive license prohibiting wide
redistribution without the payment of fees. You may or may not wish to trust
the implied level of development and support of 4 Harvard Professors,
depending on your application.

Both of the Spearmint options (especially the latter) have opinionated
choices of technology stack in order to do their optimizations, which means
they can do more work for you, but require more setup, than a simple little
thing like `skopt`

.
Depending on your computing environment this might be an overall plus or a
minus.

### SMAC

`SMAC`

(AGPLv3)

(sequential model-based algorithm configuration) is a versatile tool for optimizing algorithm parameters (or the parameters of some other process we can run automatically, or a function we can evaluate, such as a simulation).

SMAC has helped us speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions. Recently, we have also found it to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms. Finally, the predictive models SMAC is based on can also capture and exploit important information about the model domain, such as which input variables are most important.

We hope you find SMAC similarly useful. Ultimately, we hope that it helps algorithm designers focus on tasks that are more scientifically valuable than parameter tuning.

Python interface through pysmac.

### AutoML

Won the land-grab for the name `automl`

but is now unmaintained.

A quick overview of buzzwords, this project automates:

- Analytics (pass in data, and auto_ml will tell you the relationship of each variable to what it is youβre trying to predict).
- Feature Engineering (particularly around dates, and soon, NLP).
- Robust Scaling (turning all values into their scaled versions between the range of 0 and 1, in a way that is robust to outliers, and works with sparse matrices).
- Feature Selection (picking only the features that actually prove useful).
- Data formatting (turning a list of dictionaries into a sparse matrix, one-hot encoding categorical variables, taking the natural log of y for regression problems).
- Model Selection (which model works best for your problem).
- Hyperparameter Optimization (what hyperparameters work best for that model).
- Ensembling Subpredictors (automatically training up models to predict smaller problems within the meta problem).
- Ensembling Weak Estimators (automatically training up weak models on the larger problem itself, to inform the meta-estimatorβs decision).

## References

*Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*.

*Neural Computation*12 (8): 1889β1900.

*Advances in Neural Information Processing Systems*, 2546β54. Curran Associates, Inc.

*Journal of Machine Learning Research*13: 281β305.

*ICML*, 9.

*International Conference on Artificial Intelligence and Statistics*, 318β26.

*Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468)*, 87β94.

*arXiv:2007.04074 [Cs, Stat]*, July.

*Advances in Neural Information Processing Systems 28*, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2962β70. Curran Associates, Inc.

*Advances in Neural Information Processing Systems 20*, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 377β84. Curran Associates, Inc.

*International Conference on Machine Learning*, 1165β73. PMLR.

*PRoceedings of IJCAI, 2016*.

*Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence*, 250β59. UAIβ14. Arlington, Virginia, United States: AUAI Press.

*arXiv:1604.00772 [Cs, Stat]*, April.

*Learning and Intelligent Optimization*, 6683:507β23. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, Berlin, Heidelberg.

*Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation*, 1209β16. GECCO β13 Companion. New York, NY, USA: ACM.

*arXiv:1502.07943 [Cs, Stat]*, February.

*International Conference on Artificial Intelligence and Statistics*, 133β42. PMLR.

*arXiv:1810.05934 [Cs, Stat]*, March.

*arXiv:1603.06560 [Cs, Stat]*, March.

*The Journal of Machine Learning Research*18 (1): 6765β6816.

*arXiv:1806.09055 [Cs, Stat]*, April.

*arXiv:1802.09419 [Cs]*, February.

*International Conference on Artificial Intelligence and Statistics*, 1540β52. PMLR.

*Neural Computation*11 (5): 1035β68.

*Proceedings of the 32nd International Conference on Machine Learning*, 2113β22. PMLR.

*Optimization Techniques IFIP Technical Conference: Novosibirsk, July 1β7, 1974*, edited by G. I. Marchuk, 400β404. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer.

*Journal of the Royal Statistical Society: Series B (Methodological)*40 (1): 1β24.

*Proceedings of the 1987 International Conference on Neural Information Processing Systems*, 612β21. NIPSβ87. Cambridge, MA, USA: MIT Press.

*Proceedings of the 32nd International Conference on Machine Learning (ICML-15)*, 1218β26. ICMLβ15. Lille, France: JMLR.org.

*Advances in Neural Information Processing Systems*, 2951β59. Curran Associates, Inc.

*Proceedings of the 31st International Conference on Machine Learning (ICML-14)*, 1674β82.

*IEEE Transactions on Information Theory*58 (5): 3250β65.

*Advances in Neural Information Processing Systems 26*, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 2004β12. Curran Associates, Inc.

*Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 847β55. KDD β13. New York, NY, USA: ACM.

*arXiv:2104.10201 [Cs, Stat]*, April.

## No comments yet. Why not leave one?