Random-forest-like methods

A selection of randomly stopped clocks is never far from wrong.



Doubling down on ensemble methods; mixing predictions from many weak learners (in this case decision trees) to get strong learners. Boosting, bagging and other weak-learner ensembles.

There are many flavours of random-forest-like learning systems. The rule of thumb seems to be β€œFast to train, fast to use. Gets you results. May not get you answers.” In that regard they resemble neural networks, but from the previous hype cycle.

Reasons for popularity:

  1. Decision trees can easily be applied to just about any tabular data so input preprocessing can be minimal
  2. These methods are in a certain sense self-regularising, so you can skip (certain) hyperparameter tuning
  3. There is some kind of tractable asymptotic performance analysis available for some apparently?

Related: model averaging, neural ensembles, dropout, bootstrap.

Random trees, forests, jungles

Self-regularising properties

Jeremy Kun: Why Boosting Doesn’t Overfit:

Boosting, which we covered in gruesome detail previously, has a natural measure of complexity represented by the number of rounds you run the algorithm for. Each round adds one additional β€œweak learner” weighted vote. So running for a thousand rounds gives a vote of a thousand weak learners. Despite this, boosting doesn’t over-fit on many datasets. In fact, and this is a shocking fact, researchers observed that Boosting would hit zero training error, they kept running it for more rounds, and the generalization error kept going down! It seemed like the complexity could grow arbitrarily without penalty. […] this phenomenon is a fact about voting schemes, not boosting in particular.

πŸ—

Gradient boosting

The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function Breiman (1997). Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman, (J. H. Friedman 2001, 2002) simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean (Mason et al. 1999). The later two papers introduced the view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.

Bayes

The Bayesian Additive Regression Trees Chipman, George, and McCulloch (2010), are wildly popular and successful in machine learning competitions. Kenneth Tay introduces them well.

Implementations.

LightGBM

LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

xgboost

xgboost

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

See also * chengsoonong/xgboost-tuner: A library for automatically tuning XGBoost parameters.

catboost

catboost.

surfin

Surfin by shftan

This R package computes uncertainty for random forest predictions using a fast implementation of random forests in C++. This is an exciting time for research into the theoretical properties of random forests. This R package aims to provide all state-of-the-art variance estimates in one place, to expedite research in this area and make it easier for practitioners to compare estimates.

Two variance estimates are provided: U-statistics based (Mentch & Hooker, 2016) and infinitesimal jackknife on bootstrap samples (Wager, Hastie, Efron, 2014), the latter as a wrapper to the authors’ R code randomForestCI.

More variance estimates coming soon: (1) Bootstrap-of-little-bags (Sexton and Laake 2009) (2) Infinitesimal jackknife on subsamples (Wager & Athey, 2017; Athey, Tibshirani, Wager, 2016) as a wrapper to the authors’ R package grf.

Vignette: How Uncertain Are Your Random Forest Predictions?

bartmachine

bartmachine

We present a new package in R implementing Bayesian additive regression trees (BART). The package introduces many new features for data analysis using BART such as variable selection, interaction detection, model diagnostic plots, incorporation of missing data and the ability to save trees for future prediction It is significantly faster than the current R implementation, parallelized, and capable of handling both large sample sizes and high-dimensional data.

References

Alon, Noga, Alon Gonen, Elad Hazan, and Shay Moran. 2020. β€œBoosting Simple Learners.” arXiv:2001.11704 [Cs, Stat], May.
Alquier, Pierre. 2021. β€œUser-Friendly Introduction to PAC-Bayes Bounds.” arXiv:2110.11216 [Cs, Math, Stat], October.
Athey, Susan, Julie Tibshirani, and Stefan Wager. 2019. β€œGeneralized Random Forests.” Annals of Statistics 47 (2): 1148–78.
Athey, Susan, and Stefan Wager. 2019. β€œEstimating Treatment Effects with Causal Forests: An Application.” arXiv:1902.07409 [Stat], February.
Balog, Matej, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, and Yee Whye Teh. 2016. β€œThe Mondrian Kernel.” arXiv:1606.05241 [Stat], June.
Balog, Matej, and Yee Whye Teh. 2015. β€œThe Mondrian Process for Machine Learning.” arXiv:1507.05181 [Cs, Stat], July.
Bickel, Peter J., Bo Li, Alexandre B. Tsybakov, Sara A. van de Geer, Bin Yu, TeΓ³filo ValdΓ©s, Carlos Rivero, Jianqing Fan, and Aad van der Vaart. 2006. β€œRegularization in Statistics.” Test 15 (2): 271–344.
Breiman, Leo. 1996. β€œBagging Predictors.” Machine Learning 24 (2): 123–40.
β€”β€”β€”. 1997. β€œArcing the Edge.” Statistics Department, University of California, Berkeley.
BΓΌhlmann, Peter, and Sara van de Geer. 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. 2011 edition. Heidelberg ; New York: Springer.
Chipman, Hugh A., Edward I. George, and Robert E. McCulloch. 2010. β€œBART: Bayesian Additive Regression Trees.” The Annals of Applied Statistics 4 (1): 266–98.
Clarke, Bertrand. 2003. β€œComparing Bayes Model Averaging and Stacking When Model Approximation Error Cannot Be Ignored.” The Journal of Machine Learning Research 4 (null): 683–712.
Criminisi, Antonio, Jamie Shotton, and Ender Konukoglu. 2012. Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning. Vol. 7.
Davis, Richard A., and Mikkel S. Nielsen. 2020. β€œModeling of Time Series Using Random Forests: Theoretical Developments.” Electronic Journal of Statistics 14 (2): 3644–71.
DΓ­az-Avalos, Carlos, P. Juan, and J. Mateu. 2012. β€œSimilarity Measures of Conditional Intensity Functions to Test Separability in Multidimensional Point Processes.” Stochastic Environmental Research and Risk Assessment 27 (5): 1193–1205.
FernΓ‘ndez-Delgado, Manuel, Eva Cernadas, SenΓ©n Barro, and Dinani Amorim. 2014. β€œDo We Need Hundreds of Classifiers to Solve Real World Classification Problems?” Journal of Machine Learning Research 15 (1): 3133–81.
Friedman, Jerome H. 2001. β€œGreedy Function Approximation: A Gradient Boosting Machine.” The Annals of Statistics 29 (5): 1189–1232.
β€”β€”β€”. 2002. β€œStochastic Gradient Boosting.” Computational Statistics & Data Analysis, Nonlinear Methods and Data Mining, 38 (4): 367–78.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2000. β€œAdditive Logistic Regression: A Statistical View of Boosting (With Discussion and a Rejoinder by the Authors).” The Annals of Statistics 28 (2): 337–407.
Gall, J., and V. Lempitsky. 2013. β€œClass-Specific Hough Forests for Object Detection.” In Decision Forests for Computer Vision and Medical Image Analysis, edited by A. Criminisi and J. Shotton, 143–57. Advances in Computer Vision and Pattern Recognition. Springer London.
Gammerman, Alexander, and Vladimir Vovk. 2007. β€œHedging Predictions in Machine Learning.” The Computer Journal 50 (2): 151–63.
Hazimeh, Hussein, Natalia Ponomareva, Petros Mol, Zhenyu Tan, and Rahul Mazumder. 2020. β€œThe Tree Ensemble Layer: Differentiability Meets Conditional Computation,” February.
Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. β€œDistilling the Knowledge in a Neural Network.” arXiv:1503.02531 [Cs, Stat], March.
Iranzad, Reza, Xiao Liu, W. Art Chaovalitwongse, Daniel S. Hippe, Shouyi Wang, Jie Han, Phawis Thammasorn, Chunyan Duan, Jing Zeng, and Stephen R. Bowen. 2021. β€œBoost-S: Gradient Boosted Trees for Spatial Data and Its Application to FDG-PET Imaging Data,” January.
Johnson, R., and Tong Zhang. 2014. β€œLearning Nonlinear Functions Using Regularized Greedy Forest.” IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (5): 942–54.
Kapelner, Adam, and Justin Bleich. 2016. β€œbartMachine: Machine Learning with Bayesian Additive Regression Trees.” Journal of Statistical Software 70 (4).
Lakshminarayanan, Balaji, Daniel M Roy, and Yee Whye Teh. 2014. β€œMondrian Forests: Efficient Online Random Forests.” In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 3140–48. Curran Associates, Inc.
Le, Tri M., and Bertrand S. Clarke. 2022. β€œModel Averaging Is Asymptotically Better Than Model Selection For Prediction.” Journal of Machine Learning Research 23 (33): 1–53.
Mason, Llew, Jonathan Baxter, Peter Bartlett, and Marcus Frean. 1999. β€œBoosting Algorithms as Gradient Descent.” In Proceedings of the 12th International Conference on Neural Information Processing Systems, 512–18. NIPS’99. Denver, CO: MIT Press.
Meinshausen, Nicolai. 2006. β€œQuantile Regression Forests.” Journal of Machine Learning Research 7 (35): 983–99.
Mentch, Lucas, and Siyu Zhou. 2020. β€œRandomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success.” Journal of Machine Learning Research 21 (171): 1–36.
Minka, Thomas P. 2002. β€œBayesian Model Averaging Is Not Model Combination.”
Nock, Richard, and Wilko Henecka. 2020. β€œBoosted and Differentially Private Ensembles of Decision Trees.” arXiv:2001.09384 [Cs, Stat], February.
Rahimi, Ali, and Benjamin Recht. 2009. β€œWeighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems, 1313–20. Curran Associates, Inc.
Ramchandran, Maya, and Rajarshi Mukherjee. 2021. β€œOn Ensembling Vs Merging: Least Squares and Random Forests Under Covariate Shift.” arXiv:2106.02589 [Math, Stat], June.
Schapire, Robert E., Yoav Freund, Peter Bartlett, and Wee Sun Lee. 1998. β€œBoosting the Margin: A New Explanation for the Effectiveness of Voting Methods.” The Annals of Statistics 26 (5): 1651–86.
Scornet, Erwan. 2014. β€œOn the Asymptotics of Random Forests.” arXiv:1409.2090 [Math, Stat], September.
Scornet, Erwan, GΓ©rard Biau, and Jean-Philippe Vert. 2014. β€œConsistency of Random Forests.” arXiv:1405.2881 [Math, Stat], May.
Shotton, Jamie, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, John Winn, and Antonio Criminisi. 2013. β€œDecision Jungles: Compact and Rich Models for Classification.” In NIPS.
Shwartz-Ziv, Ravid, and Amitai Armon. 2021. β€œTabular Data: Deep Learning Is Not All You Need.” arXiv:2106.03253 [Cs], June.
Zhang, Huan, Si Si, and Cho-Jui Hsieh. 2017. β€œGPU-Acceleration for Large-Scale Tree Boosting.” arXiv:1706.08359 [Cs, Stat], June.

No comments yet. Why not leave one?

GitHub-flavored Markdown & a sane subset of HTML is supported.