MLLGApr 17, 2025

When do Random Forests work?

arXiv:2504.12860v11 citationsh-index: 29
Originality Incremental advance
AI Analysis

This work provides insights into when random forests perform well, which is important for practitioners in machine learning and statistics, though it is incremental in refining existing theoretical understanding.

The paper investigates the effectiveness of randomization in random forests, finding that it reduces variance in high signal-to-noise ratio (SNR) scenarios and outperforms bagging in low SNR cases where variance dominates bias, while also showing that randomization can increase bias with fat-tailed covariates and be ineffective with irrelevant covariates.

We study the effectiveness of randomizing split-directions in random forests. Prior literature has shown that, on the one hand, randomization can reduce variance through decorrelation, and, on the other hand, randomization regularizes and works in low signal-to-noise ratio (SNR) environments. First, we bring together and revisit decorrelation and regularization by presenting a systematic analysis of out-of-sample mean-squared error (MSE) for different SNR scenarios based on commonly-used data-generating processes. We find that variance reduction tends to increase with the SNR and forests outperform bagging when the SNR is low because, in low SNR cases, variance dominates bias for both methods. Second, we show that the effectiveness of randomization is a question that goes beyond the SNR. We present a simulation study with fixed and moderate SNR, in which we examine the effectiveness of randomization for other data characteristics. In particular, we find that (i) randomization can increase bias in the presence of fat tails in the distribution of covariates; (ii) in the presence of irrelevant covariates randomization is ineffective because bias dominates variance; and (iii) when covariates are mutually correlated randomization tends to be effective because variance dominates bias. Beyond randomization, we find that, for both bagging and random forests, bias can be significantly reduced in the presence of correlated covariates. This last finding goes beyond the prevailing view that averaging mostly works by variance reduction. Given that in practice covariates are often correlated, our findings on correlated covariates could open the way for a better understanding of why random forests work well in many applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes