MLLGSTOct 2, 2019

Models under which random forests perform badly; consequences for applications

arXiv:1910.00943v7
Originality Synthesis-oriented
AI Analysis

It addresses performance issues in random forests for practitioners, offering a method to enhance reliability in applications.

The paper identifies data-generating models where Breiman's random forest converges slowly or fails to be consistent, and proposes using variable use and importance statistics to improve predictions by forcing initial splits on ignored variables.

We give examples of data-generating models under which Breiman's random forest may be extremely slow to converge to the optimal predictor or even fail to be consistent. The evidence provided for these properties is based on mostly intuitive arguments, similar to those used earlier with simpler examples, and on numerical experiments. Although one can always choose models under which random forests perform very badly, we show that simple methods based on statistics of `variable use' and `variable importance' can often be used to construct a much better predictor based on a `many-armed' random forest obtained by forcing initial splits on variables which the default version of the algorithm tends to ignore.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes