MEAILGMLJul 4, 2023

Integrating Random Forests and Generalized Linear Models for Improved Accuracy and Interpretability

Berkeley
arXiv:2307.01932v211 citationsh-index: 11
Originality Incremental advance
AI Analysis

This work addresses interpretability and accuracy issues in machine learning for domains like drug response prediction and cancer subtyping, but it is incremental as it builds on existing methods.

The paper tackled the limitations of random forests as black-box models with unstable feature importance and poor performance on smooth structures, by developing RF+ to combine random forests and generalized linear models, resulting in improved prediction accuracy and a feature importance method (MDI+) that often yields over 10% better identification of signal features.

Random forests (RFs) are among the most popular supervised learning algorithms due to their nonlinear flexibility and ease-of-use. However, as black box models, they can only be interpreted via algorithmically-defined feature importance methods, such as Mean Decrease in Impurity (MDI), which have been observed to be highly unstable and have ambiguous scientific meaning. Furthermore, they can perform poorly in the presence of smooth or additive structure. To address this, we reinterpret decision trees and MDI as linear regression and $R^2$ values, respectively, with respect to engineered features associated with the tree's decision splits. This allows us to combine the respective strengths of RFs and generalized linear models in a framework called RF+, which also yields an improved feature importance method we call MDI+. Through extensive data-inspired simulations and real-world datasets, we show that RF+ improves prediction accuracy over RFs and that MDI+ outperforms popular feature importance measures in identifying signal features, often yielding more than a 10% improvement over its closest competitor. In case studies on drug response prediction and breast cancer subtyping, we further show that MDI+ extracts well-established genes with significantly greater stability compared to existing feature importance measures.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes