LGMEOct 10, 2014

Approximate False Positive Rate Control in Selection Frequency for Random Forest

arXiv:1410.2838v13 citations
Originality Highly original
AI Analysis

This provides a principled solution for researchers in fields like neuroimaging and bioinformatics who rely on Random Forest for feature selection but currently use heuristic thresholds.

The paper tackles the lack of false positive rate control in Random Forest feature selection by developing an approximate probabilistic model to estimate false positive rates for given thresholds, enabling principled threshold determination without extra computational cost. Experimental results show it can limit false positive rates to desired levels and maintain low false negative rates, even with complex feature correlations.

Random Forest has become one of the most popular tools for feature selection. Its ability to deal with high-dimensional data makes this algorithm especially useful for studies in neuroimaging and bioinformatics. Despite its popularity and wide use, feature selection in Random Forest still lacks a crucial ingredient: false positive rate control. To date there is no efficient, principled and computationally light-weight solution to this shortcoming. As a result, researchers using Random Forest for feature selection have to resort to using heuristically set thresholds on feature rankings. This article builds an approximate probabilistic model for the feature selection process in random forest training, which allows us to compute an estimated false positive rate for a given threshold on selection frequency. Hence, it presents a principled way to determine thresholds for the selection of relevant features without any additional computational load. Experimental analysis with synthetic data demonstrates that the proposed approach can limit false positive rates on the order of the desired values and keep false negative rates low. Results show that this holds even in the presence of a complex correlation structure between features. Its good statistical properties and light-weight computational needs make this approach widely applicable to feature selection for a wide-range of applications.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes