Doubling down on ensemble methods; mixing predictions from many weak learners (in this case decision trees) to get strong learners. Boosting, bagging and other weak-learner ensembles.

There are many flavours of random-forest-like learning systems. The rule of thumb seems to be βFast to train, fast to use. Gets you results. May not get you answers.β In that regard they resemble neural networks, but from the previous hype cycle.

Reasons for popularity:

- Decision trees can easily be applied to just about any tabular data so input preprocessing can be minimal
- These methods are in a certain sense self-regularising, so you can skip (certain) hyperparameter tuning
- There is some kind of tractable asymptotic performance analysis available for some apparently?

Related: model averaging, neural ensembles, dropout, bootstrap.

## Random trees, forests, jungles

- Awesome Random Forests
- how to do machine vision using random forests brought to you by the folks behind Kinect.
- generalized random forests [Athey, Tibshirani, and Wager (2019)} (implementation) are a mild generalisation.

## Self-regularising properties

Jeremy Kun: Why Boosting Doesnβt Overfit:

Boosting, which we covered in gruesome detail previously, has a natural measure of complexity represented by the number of rounds you run the algorithm for. Each round adds one additional βweak learnerβ weighted vote. So running for a thousand rounds gives a vote of a thousand weak learners. Despite this, boosting doesnβt over-fit on many datasets. In fact, and this is a shocking fact, researchers observed that Boosting would hit zero training error, they kept running it for more rounds, and the generalization error kept going down! It seemed like the complexity could grow arbitrarily without penalty. [β¦] this phenomenon is a fact about voting schemes, not boosting in particular.

π

## Gradient boosting

The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function Breiman (1997). Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman, (J. H. Friedman 2001, 2002) simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean (Mason et al. 1999). The later two papers introduced the view of boosting algorithms as iterative

functional gradient descentalgorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.

## Bayes

The Bayesian Additive Regression Trees Chipman, George, and McCulloch (2010), are wildly popular and successful in machine learning competitions. Kenneth Tay introduces them well.

## Implementations.

### LightGBM

**LightGBM** is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:

Faster training speed and higher efficiency.

Lower memory usage.

Better accuracy.

Support of parallel, distributed, and GPU learning.

Capable of handling large-scale data.

### xgboost

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

See also * chengsoonong/xgboost-tuner: A library for automatically tuning XGBoost parameters.

`catboost`

### surfin

This R package computes uncertainty for random forest predictions using a fast implementation of random forests in C++. This is an exciting time for research into the theoretical properties of random forests. This R package aims to provide all state-of-the-art variance estimates in one place, to expedite research in this area and make it easier for practitioners to compare estimates.

Two variance estimates are provided: U-statistics based (Mentch & Hooker, 2016) and infinitesimal jackknife on bootstrap samples (Wager, Hastie, Efron, 2014), the latter as a wrapper to the authorsβ R code randomForestCI.

More variance estimates coming soon: (1) Bootstrap-of-little-bags (Sexton and Laake 2009) (2) Infinitesimal jackknife on subsamples (Wager & Athey, 2017; Athey, Tibshirani, Wager, 2016) as a wrapper to the authorsβ R package grf.

### bartmachine

We present a new package in R implementing Bayesian additive regression trees (BART). The package introduces many new features for data analysis using BART such as variable selection, interaction detection, model diagnostic plots, incorporation of missing data and the ability to save trees for future prediction It is significantly faster than the current R implementation, parallelized, and capable of handling both large sample sizes and high-dimensional data.

## References

*arXiv:2001.11704 [Cs, Stat]*, May.

*arXiv:2110.11216 [Cs, Math, Stat]*, October.

*Annals of Statistics*47 (2): 1148β78.

*arXiv:1902.07409 [Stat]*, February.

*arXiv:1606.05241 [Stat]*, June.

*arXiv:1507.05181 [Cs, Stat]*, July.

*Test*15 (2): 271β344.

*Machine Learning*24 (2): 123β40.

*Statistics for High-Dimensional Data: Methods, Theory and Applications*. 2011 edition. Heidelberg ; New York: Springer.

*The Annals of Applied Statistics*4 (1): 266β98.

*The Journal of Machine Learning Research*4 (null): 683β712.

*Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning*. Vol. 7.

*Electronic Journal of Statistics*14 (2): 3644β71.

*Stochastic Environmental Research and Risk Assessment*27 (5): 1193β1205.

*Stats*4 (4): 1091β1115.

*Journal of Machine Learning Research*15 (1): 3133β81.

*The Annals of Statistics*29 (5): 1189β1232.

*Computational Statistics & Data Analysis*, Nonlinear Methods and Data Mining, 38 (4): 367β78.

*The Annals of Statistics*28 (2): 337β407.

*Decision Forests for Computer Vision and Medical Image Analysis*, edited by A. Criminisi and J. Shotton, 143β57. Advances in Computer Vision and Pattern Recognition. Springer London.

*The Computer Journal*50 (2): 151β63.

*arXiv:1503.02531 [Cs, Stat]*, March.

*IEEE Transactions on Pattern Analysis and Machine Intelligence*36 (5): 942β54.

*Journal of Statistical Software*70 (4).

*Advances in Neural Information Processing Systems 27*, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 3140β48. Curran Associates, Inc.

*Journal of Machine Learning Research*23 (33): 1β53.

*Proceedings of the 12th International Conference on Neural Information Processing Systems*, 512β18. NIPSβ99. Denver, CO: MIT Press.

*Journal of Machine Learning Research*7 (35): 983β99.

*Journal of Machine Learning Research*21 (171): 1β36.

*arXiv:2001.09384 [Cs, Stat]*, February.

*Advances in Neural Information Processing Systems*, 1313β20. Curran Associates, Inc.

*arXiv:2106.02589 [Math, Stat]*, June.

*The Annals of Statistics*26 (5): 1651β86.

*arXiv:1409.2090 [Math, Stat]*, September.

*arXiv:1405.2881 [Math, Stat]*, May.

*NIPS*.

*arXiv:2106.03253 [Cs]*, June.

*arXiv:1706.08359 [Cs, Stat]*, June.

## No comments yet. Why not leave one?