Hyperparameter optimization in ML

Replacing a hyperparameter problem with a hyperhyperparameter problem which feels like progress I guess

Split off from autoML.

The art of choosing the best hyperparameters for your ML model’s algorithms, of which there may be many.

Should you bother getting fancy about this? Ben Recht argues no, that random search is competitive with highly tuned Bayesian methods in hyperparameter tuning. Kevin Jamieson argues you can be cleverer than that though. Let’s inhale some hype.

Bayesian/surrogate optimisation

Loosely, we think of interpolating between observations of a loss surface and guessing where the optimal point is. See Bayesian optimisation. This is generic. Not as popular in practice as I might have assumed because it turns out to be fairly greedy with data and does not exploit problem-specific ideas, such as early stopping, which is saves time and is in any case a useful type of neural net regularisation.

Differentiable hyperparameter optimisation

Maclaurin, Duvenaud, and Adams (2015):

Hyperparameter optimization by gradient descent

Each meta-iteration runs an entire training run of stochastic gradient de- scent to optimize elementary parameters (weights 1 and 2). Gradients of the validation loss with respect to hyperparameters are then computed by propagating gradients back through the elementary training iterations. Hyperparameters (in this case, learning rate and momentum schedules) are then updated in the direction of this hypergradient. … The last remaining parameter to SGD is the initial parameter vector. Treating this vector as a hyperparameter blurs the distinction between learning and meta-learning. In the extreme case where all elementary learning rates are set to zero, the training set ceases to matter and the meta-learning procedure exactly reduces to elementary learning on the validation set. Due to philosophical vertigo, we chose not to optimize the initial parameter vector.

Their implementation, hypergrad, is no longer maintained. Possibly the same, drmad by Fu et al. (2016), also not maintained.

This is a really neat trick, but it has limited applicability for various reason - since it generally requires an estimate of the overfitting penalty as in the style of a degrees-of-freedom penalty. There are various assumptions on the optimisation and model process also that I forget right now.


Most of the implementations here use, internally, a surrogate model for parameter tuning, but wrap it with some tools to control and launch experiments in parallel, early termination etc.

Arranged so that the top few are hyped and popular and after that are less fancy hipster options.


determined includes hyperparameter tuning which is not in fact a surrogate surface, but an early stopping pruning of crappy models in a random search., i.e. fancy random search.


Ray includes Ray.Tune

Tune is a Python library for experiment execution and hyperparameter tuning at any scale. Core features:


optuna (Akiba et al. 2019) supports fancy neural net training; very similar to hyperopt AFAICT except that is supports Covariance Matrix Adaptation, whatever that is ? (see Hansen (2016)).

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Thanks to our define-by-run API, the code written with Optuna enjoys high modularity, and the user of Optuna can dynamically construct the search spaces for the hyperparameters.


hyperopt Bergstra, Yamins, and Cox (2013)

is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.

Currently two algorithms are implemented in hyperopt:

  • Random Search
  • Tree of Parzen Estimators (TPE)

Hyperopt has been designed to accommodate Bayesian optimization algorithms based on Gaussian processes and regression trees, but these are not currently implemented.

All algorithms can be run either serially, or in parallel by communicating via MongoDB or Apache Spark


auto-sklearn has recently been upgraded. details TBD.@FeurerAutoSklearn2020


skopt (aka scikit-optimize)

[…] is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization.



Spearmint is a package to perform Bayesian optimization according to the algorithms outlined in the paper (Snoek, Larochelle, and Adams 2012).

The code consists of several parts. It is designed to be modular to allow swapping out various ‘driver’ and ‘chooser’ modules. The ‘chooser’ modules are implementations of acquisition functions such as expected improvement, UCB or random. The drivers determine how experiments are distributed and run on the system. As the code is designed to run experiments in parallel (spawning a new experiment as soon a result comes in), this requires some engineering.

Spearmint2 is similar, but more recently updated and fancier; however it has a restrictive license prohibiting wide redistribution without the payment of fees. You may or may not wish to trust the implied level of development and support of 4 Harvard Professors, depending on your application.

Both of the Spearmint options (especially the latter) have opinionated choices of technology stack in order to do their optimizations, which means they can do more work for you, but require more setup, than a simple little thing like skopt. Depending on your computing environment this might be an overall plus or a minus.



(sequential model-based algorithm configuration) is a versatile tool for optimizing algorithm parameters (or the parameters of some other process we can run automatically, or a function we can evaluate, such as a simulation).

SMAC has helped us speed up both local search and tree search algorithms by orders of magnitude on certain instance distributions. Recently, we have also found it to be very effective for the hyperparameter optimization of machine learning algorithms, scaling better to high dimensions and discrete input dimensions than other algorithms. Finally, the predictive models SMAC is based on can also capture and exploit important information about the model domain, such as which input variables are most important.

We hope you find SMAC similarly useful. Ultimately, we hope that it helps algorithm designers focus on tasks that are more scientifically valuable than parameter tuning.

Python interface through pysmac.



Won the land-grab for the name automl but is now unmaintained.

A quick overview of buzzwords, this project automates:

  • Analytics (pass in data, and auto_ml will tell you the relationship of each variable to what it is you’re trying to predict).
  • Feature Engineering (particularly around dates, and soon, NLP).
  • Robust Scaling (turning all values into their scaled versions between the range of 0 and 1, in a way that is robust to outliers, and works with sparse matrices).
  • Feature Selection (picking only the features that actually prove useful).
  • Data formatting (turning a list of dictionaries into a sparse matrix, one-hot encoding categorical variables, taking the natural log of y for regression problems).
  • Model Selection (which model works best for your problem).
  • Hyperparameter Optimization (what hyperparameters work best for that model).
  • Ensembling Subpredictors (automatically training up models to predict smaller problems within the meta problem).
  • Ensembling Weak Estimators (automatically training up weak models on the larger problem itself, to inform the meta-estimator’s decision).

Abdel-Gawad, Ahmed, and Simon Ratner. 2007. “Adaptive Optimization of Hyperparameters in L2-Regularised Logistic Regression.” http://cs229.stanford.edu/proj2007/AbdelGawadRatner-AdaptiveHyperparameterOptimization.pdf.

Akiba, Takuya, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. “Optuna: A Next-Generation Hyperparameter Optimization Framework.” In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Bengio, Yoshua. 2000. “Gradient-Based Optimization of Hyperparameters.” Neural Computation 12 (8): 1889–1900. https://doi.org/10.1162/089976600300015187.

Bergstra, James, and Yoshua Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research 13: 281–305. http://jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf.

Bergstra, James S., Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. 2011. “Algorithms for Hyper-Parameter Optimization.” In Advances in Neural Information Processing Systems, 2546–54. Curran Associates, Inc. http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.

Bergstra, J, D Yamins, and D D Cox. 2013. “Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures.” In ICML, 9.

Domke, Justin. 2012. “Generic Methods for Optimization-Based Modeling.” In International Conference on Artificial Intelligence and Statistics, 318–26. http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2012_Domke12.pdf.

Eggensperger, Katharina, Matthias Feurer, Frank Hutter, James Bergstra, Jasper Snoek, Holger H. Hoos, and Kevin Leyton-Brown. n.d. “Towards an Empirical Foundation for Assessing Bayesian Optimization of Hyperparameters.” Accessed August 21, 2017. http://www.automl.org/papers/13-BayesOpt_EmpiricalFoundation.pdf.

Eigenmann, R., and J. A. Nossek. 1999. “Gradient Based Adaptive Regularization.” In Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468), 87–94. https://doi.org/10.1109/NNSP.1999.788126.

Feurer, Matthias, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. “Auto-Sklearn 2.0: The Next Generation.” July 8, 2020. http://arxiv.org/abs/2007.04074.

Feurer, Matthias, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. “Efficient and Robust Automated Machine Learning.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 2962–70. Curran Associates, Inc. http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf.

Foo, Chuan-sheng, Chuong B. Do, and Andrew Y. Ng. 2008. “Efficient Multiple Hyperparameter Learning for Log-Linear Models.” In Advances in Neural Information Processing Systems 20, edited by J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, 377–84. Curran Associates, Inc. http://papers.nips.cc/paper/3286-efficient-multiple-hyperparameter-learning-for-log-linear-models.pdf.

Franceschi, Luca, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. 2017. “On Hyperparameter Optimization in Learning Systems.” In. https://arxiv.org/abs/1703.01785.

Fu, Jie, Hongyin Luo, Jiashi Feng, Kian Hsiang Low, and Tat-Seng Chua. 2016. “DrMAD: Distilling Reverse-Mode Automatic Differentiation for Optimizing Hyperparameters of Deep Neural Networks.” In PRoceedings of IJCAI, 2016. http://arxiv.org/abs/1601.00917.

Gelbart, Michael A., Jasper Snoek, and Ryan P. Adams. 2014. “Bayesian Optimization with Unknown Constraints.” In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, 250–59. UAI’14. Arlington, Virginia, United States: AUAI Press. http://hips.seas.harvard.edu/files/gelbart-constrained-uai-2014.pdf.

Grünewälder, Steffen, Jean-Yves Audibert, Manfred Opper, and John Shawe-Taylor. 2010. “Regret Bounds for Gaussian Process Bandit Problems.” In, 9:273–80. https://hal-enpc.archives-ouvertes.fr/hal-00654517/document.

Hansen, Nikolaus. 2016. “The CMA Evolution Strategy: A Tutorial.” April 4, 2016. http://arxiv.org/abs/1604.00772.

Hutter, Frank, Holger H. Hoos, and Kevin Leyton-Brown. 2011. “Sequential Model-Based Optimization for General Algorithm Configuration.” In Learning and Intelligent Optimization, 6683:507–23. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25566-3_40.

Hutter, Frank, Holger Hoos, and Kevin Leyton-Brown. 2013. “An Evaluation of Sequential Model-Based Optimization for Expensive Blackbox Functions.” In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation, 1209–16. GECCO ’13 Companion. New York, NY, USA: ACM. https://doi.org/10.1145/2464576.2501592.

Jamieson, Kevin, and Ameet Talwalkar. 2015. “Non-Stochastic Best Arm Identification and Hyperparameter Optimization.” February 27, 2015. http://arxiv.org/abs/1502.07943.

Kandasamy, Kirthevasan, Akshay Krishnamurthy, Jeff Schneider, and Barnabas Poczos. 2018. “Parallelised Bayesian Optimisation via Thompson Sampling.” In International Conference on Artificial Intelligence and Statistics, 133–42. PMLR. http://proceedings.mlr.press/v84/kandasamy18a.html.

Li, Liam, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. 2020. “A System for Massively Parallel Hyperparameter Tuning.” March 15, 2020. http://arxiv.org/abs/1810.05934.

Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2016. “Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits.” March 21, 2016. http://arxiv.org/abs/1603.06560.

———. 2017. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.” The Journal of Machine Learning Research 18 (1): 6765–6816. https://jmlr.csail.mit.edu/papers/v18/16-558.html.

Liu, Hanxiao, Karen Simonyan, and Yiming Yang. 2019. “DARTS: Differentiable Architecture Search.” April 23, 2019. http://arxiv.org/abs/1806.09055.

Lorraine, Jonathan, and David Duvenaud. 2018. “Stochastic Hyperparameter Optimization Through Hypernetworks.” February 26, 2018. http://arxiv.org/abs/1802.09419.

Lorraine, Jonathan, Paul Vicol, and David Duvenaud. 2020. “Optimizing Millions of Hyperparameters by Implicit Differentiation.” In International Conference on Artificial Intelligence and Statistics, 1540–52. PMLR. http://proceedings.mlr.press/v108/lorraine20a.html.

MacKay, David JC. 1999. “Comparison of Approximate Methods for Handling Hyperparameters.” Neural Computation 11 (5): 1035–68. https://doi.org/10.1162/089976699300016331.

Maclaurin, Dougal, David K. Duvenaud, and Ryan P. Adams. 2015. “Gradient-Based Hyperparameter Optimization Through Reversible Learning.” In ICML, 2113–22. http://www.jmlr.org/proceedings/papers/v37/maclaurin15.pdf.

Močkus, J. 1975. “On Bayesian Methods for Seeking the Extremum.” In Optimization Techniques IFIP Technical Conference, edited by Prof Dr G. I. Marchuk, 400–404. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-662-38527-2_55.

Real, Esteban, Chen Liang, David R. So, and Quoc V. Le. 2020. “AutoML-Zero: Evolving Machine Learning Algorithms from Scratch,” March. https://arxiv.org/abs/2003.03384v1.

Salimans, Tim, Diederik Kingma, and Max Welling. 2015. “Markov Chain Monte Carlo and Variational Inference: Bridging the Gap.” In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), 1218–26. ICML’15. Lille, France: JMLR.org. http://proceedings.mlr.press/v37/salimans15.html.

Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. 2012. “Practical Bayesian Optimization of Machine Learning Algorithms.” In Advances in Neural Information Processing Systems, 2951–9. Curran Associates, Inc. http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.

Snoek, Jasper, Kevin Swersky, Rich Zemel, and Ryan Adams. 2014. “Input Warping for Bayesian Optimization of Non-Stationary Functions.” In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 1674–82. http://www.jmlr.org/proceedings/papers/v32/snoek14.pdf.

Srinivas, Niranjan, Andreas Krause, Sham M. Kakade, and Matthias Seeger. 2012. “Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design.” IEEE Transactions on Information Theory 58 (5): 3250–65. https://doi.org/10.1109/TIT.2011.2182033.

Swersky, Kevin, Jasper Snoek, and Ryan P Adams. 2013. “Multi-Task Bayesian Optimization.” In Advances in Neural Information Processing Systems 26, edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, 2004–12. Curran Associates, Inc. http://papers.nips.cc/paper/5086-multi-task-bayesian-optimization.pdf.

Thornton, Chris, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2013. “Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms.” In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 847–55. KDD ’13. New York, NY, USA: ACM. https://doi.org/10.1145/2487575.2487629.

Touchette, Hugo. 2011. “A Basic Introduction to Large Deviations: Theory, Applications, Simulations,” June. https://arxiv.org/abs/1106.4146v3.