MLFeb 3
Principled Federated Random Forests for Heterogeneous DataRémi Khellaf, Erwan Scornet, Aurélien Bellet et al.
Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.
MLSep 30, 2022
Sparse tree-based initialization for neural networksPatrick Lutz, Ludovic Arnould, Claire Boyer et al.
Dedicated neural network (NN) architectures have been designed to handle specific data types (such as CNN for images or RNN for text), which ranks them among state-of-the-art methods for dealing with these data. Unfortunately, no architecture has been found for dealing with tabular data yet, for which tree ensemble methods (tree boosting, random forests) usually show the best predictive performances. In this work, we propose a new sparse initialization technique for (potentially deep) multilayer perceptrons (MLP): we first train a tree-based procedure to detect feature interactions and use the resulting information to initialize the network, which is subsequently trained via standard stochastic gradient strategies. Numerical experiments on several tabular data sets show that this new, simple and easy-to-use method is a solid concurrent, both in terms of generalization capacity and computation time, to default MLP initialization and even to existing complex deep learning solutions. In fact, this wise MLP initialization raises the resulting NN methods to the level of a valid competitor to gradient boosting when dealing with tabular data. Besides, such initializations are able to preserve the sparsity of weights introduced in the first layers of the network through training. This fact suggests that this new initializer operates an implicit regularization during the NN training, and emphasizes that the first layers act as a sparse feature extractor (as for convolutional layers in CNN).
MLFeb 2
Privacy Amplification by Missing DataSimon Roburin, Rafaël Pinot, Erwan Scornet
Privacy preservation is a fundamental requirement in many high-stakes domains such as medicine and finance, where sensitive personal data must be analyzed without compromising individual confidentiality. At the same time, these applications often involve datasets with missing values due to non-response, data corruption, or deliberate anonymization. Missing data is traditionally viewed as a limitation because it reduces the information available to analysts and can degrade model performance. In this work, we take an alternative perspective and study missing data from a privacy preservation standpoint. Intuitively, when features are missing, less information is revealed about individuals, suggesting that missingness could inherently enhance privacy. We formalize this intuition by analyzing missing data as a privacy amplification mechanism within the framework of differential privacy. We show, for the first time, that incomplete data can yield privacy amplification for differentially private algorithms.
37.8MLMay 19
Increasing Missingness to Reduce Bias: Richardson-SGD with Missing DataFerdinand Genans, Erwan Scornet
Stochastic gradient methods are central to modern large-scale learning, but their use with incomplete covariates remains delicate since imputation schemes generally introduce systematic gradient biases, as shown for linear models. In this work, we prove that all parametric models exhibit similar gradient bias for various imputation procedures and characterize exactly the dependence on the missingness ratio vector $p$, with $O(\|p\|)$ as the leading term. We exploit this analysis to propose a simple debiasing procedure for stochastic gradient descent (SGD) with missing values based on Richardson extrapolation, which leverages the exact expression of the gradient bias. The key idea is to \emph{deliberately add missingness}: from an already incomplete observation, we generate a further-thinned version at a higher, controlled missingness level, and combine the two resulting stochastic gradients to cancel the leading bias term. We prove that one Richardson step reduces the gradient bias from $O(\|p\|)$ to $O(\|p\|^2)$ under several missingness scenarios. Our proposed method is computationally efficient, model-agnostic and applies to any parametric loss whose stochastic gradient can be computed after imputation. Furthermore, when missing indicators are independent, the population gradient bias is a multilinear polynomial in $p$ and depends only on population gradient errors induced by declaring a single coordinate missing. In this case, our method generalizes to a multi-step Richardson procedure which recursively cancels higher-order terms. Empirically, Richardson debiasing improves optimization and estimation across several generalized linear models and combines positively with widely used imputation procedures such as MICE. These results suggest that, somewhat counter-intuitively, adding controlled missingness on top of existing missing data can make stochastic learning from incomplete data more accurate.
MLFeb 26, 2021Code
MDA for random forests: inconsistency, and a practical solution via the Sobol-MDAClément Bénard, Sébastien da Veiga, Erwan Scornet
Variable importance measures are the main tools to analyze the black-box mechanisms of random forests. Although the mean decrease accuracy (MDA) is widely accepted as the most efficient variable importance measure for random forests, little is known about its statistical properties. In fact, the definition of MDA varies across the main random forest software. In this article, our objective is to rigorously analyze the behavior of the main MDA implementations. Consequently, we mathematically formalize the various implemented MDA algorithms, and then establish their limits when the sample size increases. This asymptotic analysis reveals that these MDA versions differ as importance measures, since they converge towards different quantities. More importantly, we break down these limits into three components: the first two terms are related to Sobol indices, which are well-defined measures of a covariate contribution to the response variance, widely used in the sensitivity analysis field, as opposed to the third term, whose value increases with dependence within covariates. Thus, we theoretically demonstrate that the MDA does not target the right quantity to detect influential covariates in a dependent setting, a fact that has already been noticed experimentally. To address this issue, we define a new importance measure for random forests, the Sobol-MDA, which fixes the flaws of the original MDA, and consistently estimates the accuracy decrease of the forest retrained without a given covariate, but with an efficient computational cost. The Sobol-MDA empirically outperforms its competitors on both simulated and real data for variable selection. An open source implementation in R and C++ is available online.
MLFeb 6, 2024
Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variantsAbdoulaye Sakho, Emmanuel Malherbe, Erwan Scornet
Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we derive several non-asymptotic upper bound on SMOTE density. From these results, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We confirm and illustrate empirically this first theoretical behavior on a real-world data-set.bFurthermore, we prove that SMOTE density vanishes near the boundary of the support of the minority class distribution. We then adapt SMOTE based on our theoretical findings to introduce two new variants. These strategies are compared on 13 tabular data sets with 10 state-of-the-art rebalancing procedures, including deep generative and diffusion models. One of our key findings is that, for most data sets, applying no rebalancing strategy is competitive in terms of predictive performances, would it be with LightGBM, tuned random forests or logistic regression. However, when the imbalance ratio is artificially augmented, one of our two modifications of SMOTE leads to promising predictive performances compared to SMOTE and other state-of-the-art strategies.
LGMar 26, 2025
Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers ScoringAbdoulaye Sakho, Emmanuel Malherbe, Carl-Erik Gauthier et al.
This study investigates rare event detection on tabular data within binary classification. Standard techniques to handle class imbalance include SMOTE, which generates synthetic samples from the minority class. However, SMOTE is intrinsically designed for continuous input variables. In fact, despite SMOTE-NC-its default extension to handle mixed features (continuous and categorical variables)-very few works propose procedures to synthesize mixed features. On the other hand, many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest. Empirically, contrary to SMOTE-NC, we show that MGS-GRF exhibits two important properties: (i) the coherence i.e. the ability to only generate combinations of categorical features that are already present in the original dataset and (ii) association, i.e. the ability to preserve the dependence between continuous and categorical features. We also evaluate the predictive performances of LightGBM classifiers trained on data sets, augmented with synthetic samples from various strategies. Our comparison is performed on simulated and public real-world data sets, as well as on a private data set from a leading financial institution. We observe that synthetic procedures that have the properties of coherence and association display better predictive performances in terms of various predictive metrics (PR and ROC AUC...), with MGS-GRF being the best one. Furthermore, our method exhibits promising results for the private banking application, with development pipeline being compliant with regulatory constraints.
MLJul 17, 2025
When Pattern-by-Pattern Works: Theoretical and Empirical Insights for Logistic Models with Missing ValuesChristophe Muller, Erwan Scornet, Julie Josse
Predicting a response with partially missing inputs remains a challenging task even in parametric models, since parameter estimation in itself is not sufficient to predict on partially observed inputs. Several works study prediction in linear models. In this paper, we focus on logistic models, which present their own difficulties. From a theoretical perspective, we prove that a Pattern-by-Pattern strategy (PbP), which learns one logistic model per missingness pattern, accurately approximates Bayes probabilities in various missing data scenarios (MCAR, MAR and MNAR). Empirically, we thoroughly compare various methods (constant and iterative imputations, complete case analysis, PbP, and an EM algorithm) across classification, probability estimation, calibration, and parameter inference. Our analysis provides a comprehensive view on the logistic regression with missing values. It reveals that mean imputation can be used as baseline for low sample sizes, and improved performance is obtained via nonlinear multiple iterative imputation techniques with the labels (MICE.RF.Y). For large sample sizes, PbP is the best method for Gaussian mixtures, and we recommend MICE.RF.Y in presence of nonlinear features.
MLJun 10, 2025
Asymptotic Normality of Infinite Centered Random Forests -Application to Imbalanced ClassificationMoria Mayala, Erwan Scornet, Charles Tillier et al.
Many classification tasks involve imbalanced data, in which a class is largely underrepresented. Several techniques consists in creating a rebalanced dataset on which a classifier is trained. In this paper, we study theoretically such a procedure, when the classifier is a Centered Random Forests (CRF). We establish a Central Limit Theorem (CLT) on the infinite CRF with explicit rates and exact constant. We then prove that the CRF trained on the rebalanced dataset exhibits a bias, which can be removed with appropriate techniques. Based on an importance sampling (IS) approach, the resulting debiased estimator, called IS-ICRF, satisfies a CLT centered at the prediction function value. For high imbalance settings, we prove that the IS-ICRF estimator enjoys a variance reduction compared to the ICRF trained on the original data. Therefore, our theoretical analysis highlights the benefits of training random forests on a rebalanced dataset (followed by a debiasing procedure) compared to using the original data. Our theoretical results, especially the variance rates and the variance reduction, appear to be valid for Breiman's random forests in our experiments.
MLFeb 3, 2022
Minimax rate of consistency for linear models with missing valuesAlexis Ayme, Claire Boyer, Aymeric Dieuleveut et al.
Missing values arise in most real-world data sets due to the aggregation of multiple sources and intrinsically missing information (sensor failure, unanswered questions in surveys...). In fact, the very nature of missing values usually prevents us from running standard learning algorithms. In this paper, we focus on the extensively-studied linear models, but in presence of missing values, which turns out to be quite a challenging task. Indeed, the Bayes rule can be decomposed as a sum of predictors corresponding to each missing pattern. This eventually requires to solve a number of learning tasks, exponential in the number of input features, which makes predictions impossible for current real-world datasets. First, we propose a rigorous setting to analyze a least-square type estimator and establish a bound on the excess risk which increases exponentially in the dimension. Consequently, we leverage the missing data distribution to propose a new algorithm, andderive associated adaptive risk bounds that turn out to be minimax optimal. Numerical experiments highlight the benefits of our method compared to state-of-the-art algorithms used for predictions with missing values.
MLJun 1, 2021
What's a good imputation to predict with missing values?Marine Le Morvan, Julie Josse, Erwan Scornet et al.
How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation is not needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations. Rather, we suggest that it is easier to learn imputation and regression jointly. We propose such a procedure, adapting NeuMiss, a neural network capturing the conditional links across observed and unobserved variables whatever the missing-value pattern. Experiments confirm that joint imputation and regression through NeuMiss is better than various two step procedures in our experiments with finite number of samples.
MLMay 25, 2021
SHAFF: Fast and consistent SHApley eFfect estimates via random ForestsClément Bénard, Gérard Biau, Sébastien da Veiga et al.
Interpretability of learning algorithms is crucial for applications involving critical decisions, and variable importance is one of the main interpretation tools. Shapley effects are now widely used to interpret both tree ensembles and neural networks, as they can efficiently handle dependence and interactions in the data, as opposed to most other variable importance measures. However, estimating Shapley effects is a challenging task, because of the computational complexity and the conditional expectation estimates. Accordingly, existing Shapley algorithms have flaws: a costly running time, or a bias when input variables are dependent. Therefore, we introduce SHAFF, SHApley eFfects via random Forests, a fast and accurate Shapley effect estimate, even when input variables are dependent. We show SHAFF efficiency through both a theoretical analysis of its consistency, and the practical performance improvements over competitors with extensive experiments. An implementation of SHAFF in C++ and R is available online.
LGOct 29, 2020
Analyzing the tree-layer structure of Deep ForestsLudovic Arnould, Claire Boyer, Erwan Scornet et al.
Random forests on the one hand, and neural networks on the other hand, have met great success in the machine learning community for their predictive performance. Combinations of both have been proposed in the literature, notably leading to the so-called deep forests (DF) (Zhou \& Feng,2019). In this paper, our aim is not to benchmark DF performances but to investigate instead their underlying mechanisms. Additionally, DF architecture can be generally simplified into more simple and computationally efficient shallow forest networks. Despite some instability, the latter may outperform standard predictive tree-based methods. We exhibit a theoretical framework in which a shallow tree network is shown to enhance the performance of classical decision trees. In such a setting, we provide tight theoretical lower and upper bounds on its excess risk. These theoretical results show the interest of tree-network architectures for well-structured data provided that the first layer, acting as a data encoder, is rich enough.
LGJul 3, 2020
NeuMiss networks: differentiable programming for supervised learning with missing valuesMarine Le Morvan, Julie Josse, Thomas Moreau et al.
The presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and self-masking (Missing Not At Random). Based on a Neumann-series approximation of the optimal predictor, we propose a new principled architecture, named NeuMiss networks. Their originality and strength come from the use of a new type of non-linearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of NeuMiss networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for medium-sized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as self-masking.
MLApr 29, 2020
Interpretable Random Forests via Rule ExtractionClément Bénard, Gérard Biau, Sébastien da Veiga et al.
We introduce SIRUS (Stable and Interpretable RUle Set) for regression, a stable rule learning algorithm which takes the form of a short and simple list of rules. State-of-the-art learning algorithms are often referred to as "black boxes" because of the high number of operations involved in their prediction process. Despite their powerful predictivity, this lack of interpretability may be highly restrictive for applications with critical decisions at stake. On the other hand, algorithms with a simple structure-typically decision trees, rule algorithms, or sparse linear models-are well known for their instability. This undesirable feature makes the conclusions of the data analysis unreliable and turns out to be a strong operational limitation. This motivates the design of SIRUS, which combines a simple structure with a remarkable stable behavior when data is perturbed. The algorithm is based on random forests, the predictive accuracy of which is preserved. We demonstrate the efficiency of the method both empirically (through experiments) and theoretically (with the proof of its asymptotic stability). Our R/C++ software implementation sirus is available from CRAN.
LGFeb 3, 2020
Linear predictor on linearly-generated data with missing values: non consistency and solutionsMarine Le Morvan, Nicolas Prost, Julie Josse et al.
We consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and the various missing-value indicators. Due to its intrinsic complexity, we study a simple approximation and prove generalization bounds with finite samples, highlighting regimes for which each method performs best. We then show that multilayer perceptrons with ReLU activation functions can be consistent, and can explore good trade-offs between the true model and approximations. Our study highlights the interesting family of models that are beneficial to fit with missing values depending on the amount of data available.
STJan 13, 2020
Trees, forests, and impurity-based variable importanceErwan Scornet
Tree ensemble methods such as random forests [Breiman, 2001] are very popular to handle high-dimensional tabular data sets, notably because of their good predictive accuracy. However, when machine learning is used for decision-making problems, settling for the best predictive procedures may not be reasonable since enlightened decisions require an in-depth comprehension of the algorithm prediction process. Unfortunately, random forests are not intrinsically interpretable since their prediction results from averaging several hundreds of decision trees. A classic approach to gain knowledge on this so-called black-box algorithm is to compute variable importances, that are employed to assess the predictive impact of each input variable. Variable importances are then used to rank or select variables and thus play a great role in data analysis. Nevertheless, there is no justification to use random forest variable importances in such way: we do not even know what these quantities estimate. In this paper, we analyze one of the two well-known random forest variable importances, the Mean Decrease Impurity (MDI). We prove that if input variables are independent and in absence of interactions, MDI provides a variance decomposition of the output, where the contribution of each variable is clearly identified. We also study models exhibiting dependence between input variables or interaction, for which the variable importance is intrinsically ill-defined. Our analysis shows that there may exist some benefits to use a forest compared to a single tree.
MLAug 19, 2019
SIRUS: Stable and Interpretable RUle Set for ClassificationClément Bénard, Gérard Biau, Sébastien da Veiga et al.
State-of-the-art learning algorithms, such as random forests or neural networks, are often qualified as "black-boxes" because of the high number and complexity of operations involved in their prediction mechanism. This lack of interpretability is a strong limitation for applications involving critical decisions, typically the analysis of production processes in the manufacturing industry. In such critical contexts, models have to be interpretable, i.e., simple, stable, and predictive. To address this issue, we design SIRUS (Stable and Interpretable RUle Set), a new classification algorithm based on random forests, which takes the form of a short list of rules. While simple models are usually unstable with respect to data perturbation, SIRUS achieves a remarkable stability improvement over cutting-edge methods. Furthermore, SIRUS inherits a predictive accuracy close to random forests, combined with the simplicity of decision trees. These properties are assessed both from a theoretical and empirical point of view, through extensive numerical experiments based on our R/C++ software implementation sirus available from CRAN.
MLJun 25, 2019
AMF: Aggregated Mondrian Forests for Online LearningJaouad Mourtada, Stéphane Gaïffas, Erwan Scornet
Random Forests (RF) is one of the algorithms of choice in many supervised learning applications, be it classification or regression. The appeal of such tree-ensemble methods comes from a combination of several characteristics: a remarkable accuracy in a variety of tasks, a small number of parameters to tune, robustness with respect to features scaling, a reasonable computational cost for training and prediction, and their suitability in high-dimensional settings. The most commonly used RF variants however are "offline" algorithms, which require the availability of the whole dataset at once. In this paper, we introduce AMF, an online random forest algorithm based on Mondrian Forests. Using a variant of the Context Tree Weighting algorithm, we show that it is possible to efficiently perform an exact aggregation over all prunings of the trees; in particular, this enables to obtain a truly online parameter-free algorithm which is competitive with the optimal pruning of the Mondrian tree, and thus adaptive to the unknown regularity of the regression function. Numerical experiments show that AMF is competitive with respect to several strong baselines on a large number of datasets for multi-class classification.
MLFeb 19, 2019
On the consistency of supervised learning with missing valuesJulie Josse, Jacob M. Chen, Nicolas Prost et al.
In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through multiple imputation. Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.
MLMar 15, 2018
Minimax optimal rates for Mondrian trees and forestsJaouad Mourtada, Stéphane Gaïffas, Erwan Scornet
Introduced by Breiman, Random Forests are widely used classification and regression algorithms. While being initially designed as batch algorithms, several variants have been proposed to handle online learning. One particular instance of such forests is the \emph{Mondrian Forest}, whose trees are built using the so-called Mondrian process, therefore allowing to easily update their construction in a streaming fashion. In this paper, we provide a thorough theoretical study of Mondrian Forests in a batch learning setting, based on new results about Mondrian partitions. Our results include consistency and convergence rates for Mondrian Trees and Forests, that turn out to be minimax optimal on the set of $s$-Hölder function with $s \in (0,1]$ (for trees and forests) and $s \in (1,2]$ (for forests only), assuming a proper tuning of their complexity parameter in both cases. Furthermore, we prove that an adaptive procedure (to the unknown $s \in (0, 2]$) can be constructed by combining Mondrian Forests with a standard model aggregation algorithm. These results are the first demonstrating that some particular random forests achieve minimax rates \textit{in arbitrary dimension}. Owing to their remarkably simple distributional properties, which lead to minimax rates, Mondrian trees are a promising basis for more sophisticated yet theoretically sound random forests variants.
MLNov 8, 2017
Universal consistency and minimax rates for online Mondrian ForestsJaouad Mourtada, Stéphane Gaïffas, Erwan Scornet
We establish the consistency of an algorithm of Mondrian Forests, a randomized classification algorithm that can be implemented online. First, we amend the original Mondrian Forest algorithm, that considers a fixed lifetime parameter. Indeed, the fact that this parameter is fixed hinders the statistical consistency of the original procedure. Our modified Mondrian Forest algorithm grows trees with increasing lifetime parameters $λ_n$, and uses an alternative updating rule, allowing to work also in an online fashion. Second, we provide a theoretical analysis establishing simple conditions for consistency. Our theoretical analysis also exhibits a surprising fact: our algorithm achieves the minimax rate (optimal rate) for the estimation of a Lipschitz regression function, which is a strong extension of previous results to an arbitrary dimension.
MLApr 25, 2016
Neural Random ForestsGérard Biau, Erwan Scornet, Johannes Welbl
Given an ensemble of randomized regression trees, it is possible to restructure them as a collection of multilayered neural networks with particular connection weights. Following this principle, we reformulate the random forest method of Breiman (2001) into a neural network setting, and in turn propose two new hybrid procedures that we call neural random forests. Both predictors exploit prior knowledge of regression trees for their architecture, have less parameters to tune than standard networks, and less restrictions on the geometry of the decision boundaries than trees. Consistency results are proved, and substantial numerical evidence is provided on both synthetic and real data sets to assess the excellent performance of our methods in a large variety of prediction problems.
STNov 18, 2015
A Random Forest Guided TourGérard Biau, Erwan Scornet
The random forest algorithm, proposed by L. Breiman in 2001, has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad-hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.
STMay 12, 2014
Consistency of random forestsErwan Scornet, Gérard Biau, Jean-Philippe Vert
Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45 (2001) 5--32] that combines several randomized decision trees and aggregates their predictions by averaging. Despite its wide usage and outstanding practical performance, little is known about the mathematical properties of the procedure. This disparity between theory and practice originates in the difficulty to simultaneously analyze both the randomization process and the highly data-dependent tree structure. In the present paper, we take a step forward in forest exploration by proving a consistency result for Breiman's [Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive regression models. Our analysis also sheds an interesting light on how random forests can nicely adapt to sparsity. 1. Introduction. Random forests are an ensemble learning method for classification and regression that constructs a number of randomized decision trees during the training phase and predicts by averaging the results. Since its publication in the seminal paper of Breiman (2001), the procedure has become a major data analysis tool, that performs well in practice in comparison with many standard methods. What has greatly contributed to the popularity of forests is the fact that they can be applied to a wide range of prediction problems and have few parameters to tune. Aside from being simple to use, the method is generally recognized for its accuracy and its ability to deal with small sample sizes, high-dimensional feature spaces and complex data structures. The random forest methodology has been successfully involved in many practical problems, including air quality prediction (winning code of the EMC data science global hackathon in 2012, see http://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al. (2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3D