Ensemble methods; mixing predictions from many weak learners to get strong learners.
The rule of thumb seems to be “Fast to train, fast to use. Gets you results. May not get you answers.” So, like neural networks but from the previous hype cycle.
Questions
In a different context, I’ve run into the general ensemble method model averaging; How does that relate to boosting/bagging algorithms? How does random tree methods relate to dropout? To boostrap?
Random trees, forests, jungles
- Awesome Random Forests
- how to do machine vision using random forests brought to you by the folks behind Kinect.
- generalized random forests [Athey, Tibshirani, and Wager (2019)} (implementationa) are a mild generalisation.
Self-regularising properties
Jeremy Kun: Why Boosting Doesn’t Overfit:
Boosting, which we covered in gruesome detail previously, has a natural measure of complexity represented by the number of rounds you run the algorithm for. Each round adds one additional “weak learner” weighted vote. So running for a thousand rounds gives a vote of a thousand weak learners. Despite this, boosting doesn’t over-fit on many datasets. In fact, and this is a shocking fact, researchers observed that Boosting would hit zero training error, they kept running it for more rounds, and the generalization error kept going down! It seemed like the complexity could grow arbitrarily without penalty. […] this phenomenon is a fact about voting schemes, not boosting in particular.
🏗
Gradient boosting
The idea of gradient boosting originated in the observation by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function Breiman (1997). Explicit regression gradient boosting algorithms were subsequently developed by Jerome H. Friedman, (J. H. Friedman 2001, 2002) simultaneously with the more general functional gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and Marcus Frean (Mason et al. 1999). The later two papers introduced the view of boosting algorithms as iterative functional gradient descent algorithms. That is, algorithms that optimize a cost function over function space by iteratively choosing a function (weak hypothesis) that points in the negative gradient direction. This functional gradient view of boosting has led to the development of boosting algorithms in many areas of machine learning and statistics beyond regression and classification.
Bayes
The Bayesian Additive Regression Trees Chipman, George, and McCulloch (2010), are wildly popular and successful in machine learning competitions. Kenneth Tay does a good intro.
Implementations
xgboost
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.
catboost
bartmachine
We present a new package in R implementing Bayesian additive regression trees (BART). The package introduces many new features for data analysis using BART such as variable selection, interaction detection, model diagnostic plots, incorporation of missing data and the ability to save trees for future prediction It is significantly faster than the current R implementation, parallelized, and capable of handling both large sample sizes and high-dimensional data.