Making Sense of Random Forest Probabilities: a Kernel Perspective
This addresses the problem of unreliable probability estimates in random forests for machine learning practitioners, though it is incremental as it builds on existing kernel methods.
The paper tackled the unprincipled method of estimating probabilities in random forests by connecting them to kernel regression, placing probability estimation on more sound statistical footing and providing tuning recommendations.
A random forest is a popular tool for estimating probabilities in machine learning classification tasks. However, the means by which this is accomplished is unprincipled: one simply counts the fraction of trees in a forest that vote for a certain class. In this paper, we forge a connection between random forests and kernel regression. This places random forest probability estimation on more sound statistical footing. As part of our investigation, we develop a model for the proximity kernel and relate it to the geometry and sparsity of the estimation problem. We also provide intuition and recommendations for tuning a random forest to improve its probability estimates.