ML LGMay 16, 2017

To tune or not to tune the number of trees in random forest?

arXiv:1705.05654v114.5447 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a practical tuning problem for users of random forests in supervised learning, but it is incremental as it refines existing understanding of bagging principles.

The paper investigates whether the number of trees in random forests should be tuned or set to a large value, finding that the error rate can be non-monotonic in some cases, based on theoretical analysis and application to 306 datasets from OpenML.

The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases be better. While the principle underlying bagging is that "more trees are better", in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.

View on arXiv PDF Code

Similar