ML LG EMAug 17, 2020

To Bag is to Prune

arXiv:2008.07063v510 citations

AI Analysis

This addresses a fundamental puzzle in machine learning for practitioners using ensemble methods, offering a new theoretical explanation with practical implications for automatic tuning.

The paper tackles the paradox of Random Forests overfitting in-sample without harming out-of-sample performance, proposing that bootstrap aggregation and model perturbation automatically prune a latent 'true' tree, and empirically shows that overfitting ensembles perform similarly or better than tuned ones.

It is notoriously difficult to build a bad Random Forest (RF). Concurrently, RF blatantly overfits in-sample without any apparent consequence out-of-sample. Standard arguments, like the classic bias-variance trade-off or double descent, cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a latent "true" tree. More generally, randomized ensembles of greedily optimized learners implicitly perform optimal early stopping out-of-sample. So there is no need to tune the stopping point. By construction, novel variants of Boosting and MARS are also eligible for automatic tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles perform similarly to their tuned counterparts -- or better.

View on arXiv PDF

Similar