Giles Hooker

ML
h-index6
30papers
1,028citations
Novelty49%
AI Score50

30 Papers

MLNov 22, 2022Code
A Generic Approach for Reproducible Model Distillation

Yunzhe Zhou, Peiru Xu, Giles Hooker

Model distillation has been a popular method for producing interpretable machine learning. It uses an interpretable "student" model to mimic the predictions made by the black box "teacher" model. However, when the student model is sensitive to the variability of the data sets used for training even when keeping the teacher fixed, the corresponded interpretation is not reliable. Existing strategies stabilize model distillation by checking whether a large enough corpus of pseudo-data is generated to reliably reproduce student models, but methods to do so have so far been developed for a specific student model. In this paper, we develop a generic approach for stable model distillation based on central limit theorem for the average loss. We start with a collection of candidate student models and search for candidates that reasonably agree with the teacher. Then we construct a multiple testing framework to select a corpus size such that the consistent student model would be selected under different pseudo samples. We demonstrate the application of our proposed approach on three commonly used intelligible models: decision trees, falling rule lists and symbolic regression. Finally, we conduct simulation experiments on Mammographic Mass and Breast Cancer datasets and illustrate the testing procedure throughout a theoretical analysis with Markov process. The code is publicly available at https://github.com/yunzhe-zhou/GenericDistillation.

MLAug 31, 2022
The Infinitesimal Jackknife and Combinations of Models

Indrayudh Ghosal, Yunzhe Zhou, Giles Hooker

The Infinitesimal Jackknife is a general method for estimating variances of parametric models, and more recently also for some ensemble methods. In this paper we extend the Infinitesimal Jackknife to estimate the covariance between any two models. This can be used to quantify uncertainty for combinations of models, or to construct test statistics for comparing different models or ensembles of models fitted using the same training dataset. Specific examples in this paper use boosted combinations of models like random forests and M-estimators. We also investigate its application on neural networks and ensembles of XGBoost models. We illustrate the efficacy of variance estimates through extensive simulations and its application to the Beijing Housing data, and demonstrate the theoretical consistency of the Infinitesimal Jackknife covariance estimate.

47.3MLMay 6
Estimating Implicit Regularization in Deep Learning

Joseph H. Rudoler, Kevin Tan, Giles Hooker et al.

Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.

MLOct 11, 2023
Stabilizing Estimates of Shapley Values with Control Variates

Jeremy Goldwasser, Giles Hooker

Shapley values are among the most popular tools for explaining predictions of blackbox machine learning models. However, their high computational cost motivates the use of sampling approximations, inducing a considerable degree of uncertainty. To stabilize these model explanations, we propose ControlSHAP, an approach based on the Monte Carlo technique of control variates. Our methodology is applicable to any machine learning model and requires virtually no extra computation or modeling effort. On several high-dimensional datasets, we find it can produce dramatic reductions in the Monte Carlo variability of Shapley estimates.

LGApr 29, 2024
Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

Xi Xin, Giles Hooker, Fei Huang

The adoption of artificial intelligence (AI) across industries has led to the widespread use of complex black-box models and interpretation tools for decision making. This paper proposes an adversarial framework to uncover the vulnerability of permutation-based interpretation methods for machine learning tasks, with a particular focus on partial dependence (PD) plots. This adversarial framework modifies the original black box model to manipulate its predictions for instances in the extrapolation domain. As a result, it produces deceptive PD plots that can conceal discriminatory behaviors while preserving most of the original model's predictions. This framework can produce multiple fooled PD plots via a single model. By using real-world datasets including an auto insurance claims dataset and COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset, our results show that it is possible to intentionally hide the discriminatory behavior of a predictor and make the black-box model appear neutral through interpretation tools like PD plots while retaining almost all the predictions of the original black-box model. Managerial insights for regulators and practitioners are provided based on the findings.

MLJan 28, 2024
Statistical Significance of Feature Importance Rankings

Jeremy Goldwasser, Giles Hooker

Feature importance scores are ubiquitous tools for understanding the predictions of machine learning models. However, many popular attribution methods suffer from high instability due to random sampling. Leveraging novel ideas from hypothesis testing, we devise techniques that ensure the most important features are correct with high-probability guarantees. These assess the set of $K$ top-ranked features, as well as the order of its elements. Given a set of local or global importance scores, we demonstrate how to retrospectively verify the stability of the highest ranks. We then introduce two efficient sampling algorithms that identify the $K$ most important features, perhaps in order, with probability exceeding $1-α$. The theoretical justification for these procedures is validated empirically on SHAP and LIME.

LGFeb 29, 2024
Longitudinal Counterfactuals: Constraints and Opportunities

Alexander Asemota, Giles Hooker

Counterfactual explanations are a common approach to providing recourse to data subjects. However, current methodology can produce counterfactuals that cannot be achieved by the subject, making the use of counterfactuals for recourse difficult to justify in practice. Though there is agreement that plausibility is an important quality when using counterfactuals for algorithmic recourse, ground truth plausibility continues to be difficult to quantify. In this paper, we propose using longitudinal data to assess and improve plausibility in counterfactuals. In particular, we develop a metric that compares longitudinal differences to counterfactual differences, allowing us to evaluate how similar a counterfactual is to prior observed changes. Furthermore, we use this metric to generate plausible counterfactuals. Finally, we discuss some of the inherent difficulties of using counterfactuals for recourse.

MLJan 26
Statistical Inference for Explainable Boosting Machines

Haimo Fang, Kevin Tan, Jonathan Pipping et al.

Explainable boosting machines (EBMs) are popular "glass-box" models that learn a set of univariate functions using boosting trees. These achieve explainability through visualizations of each feature's effect. However, unlike linear model coefficients, uncertainty quantification for the learned univariate functions requires computationally intensive bootstrapping, making it hard to know which features truly matter. We provide an alternative using recent advances in statistical inference for gradient boosting, deriving methods for statistical inference as well as end-to-end theoretical guarantees. Using a moving average instead of a sum of trees (Boulevard regularization) allows the boosting process to converge to a feature-wise kernel ridge regression. This produces asymptotically normal predictions that achieve the minimax-optimal mean squared error for fitting Lipschitz GAMs with $p$ features at rate $O(pn^{-2/3})$, successfully avoiding the curse of dimensionality. We then construct prediction intervals for the response and confidence intervals for each learned univariate function with a runtime independent of the number of datapoints, enabling further explainability within EBMs.

MLSep 27, 2025
Statistical Inference for Gradient Boosting Regression

Haimo Fang, Kevin Tan, Giles Hooker

Gradient boosting is widely popular due to its flexibility and predictive accuracy. However, statistical inference and uncertainty quantification for gradient boosting remain challenging and under-explored. We propose a unified framework for statistical inference in gradient boosting regression. Our framework integrates dropout or parallel training with a recently proposed regularization procedure that allows for a central limit theorem (CLT) for boosting. With these enhancements, we surprisingly find that increasing the dropout rate and the number of trees grown in parallel at each iteration substantially enhances signal recovery and overall performance. Our resulting algorithms enjoy similar CLTs, which we use to construct built-in confidence intervals, prediction intervals, and rigorous hypothesis tests for assessing variable importance. Numerical experiments demonstrate that our algorithms perform well, interpolate between regularized boosting and random forests, and confirm the validity of their built-in statistical inference procedures.

LGApr 21, 2025
Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks

Jeremy Goldwasser, Giles Hooker

Counterfactuals are a popular framework for interpreting machine learning predictions. These what if explanations are notoriously challenging to create for computer vision models: standard gradient-based methods are prone to produce adversarial examples, in which imperceptible modifications to image pixels provoke large changes in predictions. We introduce a new, easy-to-implement framework for counterfactual images that can flexibly adapt to contemporary advances in generative modeling. Our method, Counterfactual Attacks, resembles an adversarial attack on the representation of the image along a low-dimensional manifold. In addition, given an auxiliary dataset of image descriptors, we show how to accompany counterfactuals with feature attribution that quantify the changes between the original and counterfactual images. These importance scores can be aggregated into global counterfactual explanations that highlight the overall features driving model predictions. While this unification is possible for any counterfactual method, it has particular computational efficiency for ours. We demonstrate the efficacy of our approach with the MNIST and CelebA datasets.

LGFeb 6, 2025
Targeted Learning for Data Fairness

Alexander Asemota, Giles Hooker

Data and algorithms have the potential to produce and perpetuate discrimination and disparate treatment. As such, significant effort has been invested in developing approaches to defining, detecting, and eliminating unfair outcomes in algorithms. In this paper, we focus on performing statistical inference for fairness. Prior work in fairness inference has largely focused on inferring the fairness properties of a given predictive algorithm. Here, we expand fairness inference by evaluating fairness in the data generating process itself, referred to here as data fairness. We perform inference on data fairness using targeted learning, a flexible framework for nonparametric inference. We derive estimators demographic parity, equal opportunity, and conditional mutual information. Additionally, we find that our estimators for probabilistic metrics exploit double robustness. To validate our approach, we perform several simulations and apply our estimators to real data.

MLNov 4, 2024
Targeted Learning for Variable Importance

Xiaohan Wang, Yunzhe Zhou, Giles Hooker

Variable importance is one of the most widely used measures for interpreting machine learning with significant interest from both statistics and machine learning communities. Recently, increasing attention has been directed toward uncertainty quantification in these metrics. Current approaches largely rely on one-step procedures, which, while asymptotically efficient, can present higher sensitivity and instability in finite sample settings. To address these limitations, we propose a novel method by employing the targeted learning (TL) framework, designed to enhance robustness in inference for variable importance metrics. Our approach is particularly suited for conditional permutation variable importance. We show that it (i) retains the asymptotic efficiency of traditional methods, (ii) maintains comparable computational complexity, and (iii) delivers improved accuracy, especially in finite sample contexts. We further support these findings with numerical experiments that illustrate the practical advantages of our method and validate the theoretical results.

MLJun 15, 2021
S-LIME: Stabilized-LIME for Model Explanation

Zhengze Zhou, Giles Hooker, Fei Wang

An increasing number of machine learning models have been deployed in domains with high stakes such as finance and healthcare. Despite their superior performances, many models are black boxes in nature which are hard to explain. There are growing efforts for researchers to develop methods to interpret these black-box models. Post hoc explanations based on perturbations, such as LIME, are widely used approaches to interpret a machine learning model after it has been built. This class of methods has been shown to exhibit large instability, posing serious challenges to the effectiveness of the method itself and harming user trust. In this paper, we propose S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation. Experiments on both simulated and real world data sets are provided to demonstrate the effectiveness of our method.

MEFeb 24, 2021
Generalised Boosted Forests

Indrayudh Ghosal, Giles Hooker

This paper extends recent work on boosting random forests to model non-Gaussian responses. Given an exponential family $\mathbb{E}[Y|X] = g^{-1}(f(X))$ our goal is to obtain an estimate for $f$. We start with an MLE-type estimate in the link space and then define generalised residuals from it. We use these residuals and some corresponding weights to fit a base random forest and then repeat the same to obtain a boost random forest. We call the sum of these three estimators a \textit{generalised boosted forest}. We show with simulated and real data that both the random forest steps reduces test-set log-likelihood, which we treat as our primary metric. We also provide a variance estimator, which we can obtain with the same computational cost as the original estimate itself. Empirical experiments on real-world data and simulations demonstrate that the methods can effectively reduce bias, and that confidence interval coverage is conservative in the bulk of the covariate distribution.

OTFeb 23, 2021
Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning

Lucas Mentch, Giles Hooker

In 2001, Leo Breiman wrote of a divide between "data modeling" and "algorithmic modeling" cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the "data modelers" incorporating algorithmic methods into their toolbox, particularly driven by recent developments in the statistical understanding of Breiman's own Random Forest methods. While this can be simplistically described as "Breiman won", these same developments also expose the limitations of the prediction-first philosophy that he espoused, making careful statistical analysis all the more important. This paper outlines these exciting recent developments in the random forest literature which, in our view, occurred as a result of a necessary blending of the two ways of thinking Breiman originally described. We also ask what areas statistics and statisticians might currently overlook.

MLDec 2, 2019
$V$-statistics and Variance Estimation

Zhengze Zhou, Lucas Mentch, Giles Hooker

This paper develops a general framework for analyzing asymptotics of $V$-statistics. Previous literature on limiting distribution mainly focuses on the cases when $n \to \infty$ with fixed kernel size $k$. Under some regularity conditions, we demonstrate asymptotic normality when $k$ grows with $n$ by utilizing existing results for $U$-statistics. The key in our approach lies in a mathematical reduction to $U$-statistics by designing an equivalent kernel for $V$-statistics. We also provide a unified treatment on variance estimation for both $U$- and $V$-statistics by observing connections to existing methods and proposing an empirically more accurate estimator. Ensemble methods such as random forests, where multiple base learners are trained and aggregated for prediction purposes, serve as a running example throughout the paper because they are a natural and flexible application of $V$-statistics.

MLNov 12, 2019
Purifying Interaction Effects with the Functional ANOVA: An Efficient Algorithm for Recovering Identifiable Additive Models

Benjamin Lengerich, Sarah Tan, Chun-Hao Chang et al.

Models which estimate main effects of individual variables alongside interaction effects have an identifiability challenge: effects can be freely moved between main effects and interaction effects without changing the model prediction. This is a critical problem for interpretability because it permits "contradictory" models to represent the same function. To solve this problem, we propose pure interaction effects: variance in the outcome which cannot be represented by any smaller subset of features. This definition has an equivalence with the Functional ANOVA decomposition. To compute this decomposition, we present a fast, exact algorithm that transforms any piecewise-constant function (such as a tree-based model) into a purified, canonical representation. We apply this algorithm to Generalized Additive Models with interactions trained on several datasets and show large disparity, including contradictions, between the effects before and after purification. These results underscore the need to specify data distributions and ensure identifiability before interpreting model parameters.

MEMay 1, 2019
Unrestricted Permutation forces Extrapolation: Variable Importance Requires at least One More Model, or There Is No Free Variable Importance

Giles Hooker, Lucas Mentch, Siyu Zhou

This paper reviews and advocates against the use of permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because they are both model-agnostic and depend only on the pre-trained model output, making them computationally efficient and widely available in software. However, numerous studies have found that these tools can produce diagnostics that are highly misleading, particularly when there is strong dependence among features. The purpose of our work here is to (i) review this growing body of literature, (ii) provide further demonstrations of these drawbacks along with a detailed explanation as to why they occur, and (iii) advocate for alternative measures that involve additional modeling. In particular, we describe how breaking dependencies between features in hold-out data places undue emphasis on sparse regions of the feature space by forcing the original model to extrapolate to regions where there is little to no data. We explore these effects across various model setups and find support for previous claims in the literature that PaP metrics can vastly over-emphasize correlated features in both variable importance measures and partial dependence plots. As an alternative, we discuss and recommend more direct approaches that involve measuring the change in model performance after muting the effects of the features under investigation.

MLMar 12, 2019
Unbiased Measurement of Feature Importance in Tree-Based Methods

Zhengze Zhou, Giles Hooker

We propose a modification that corrects for split-improvement variable importance measures in Random Forests and other tree-based methods. These methods have been shown to be biased towards increasing the importance of features with more potential splits. We show that by appropriately incorporating split-improvement as measured on out of sample data, this bias can be corrected yielding better summaries and screening tools.

MLAug 22, 2018
Approximation Trees: Statistical Stability in Model Distillation

Yichen Zhou, Zhengze Zhou, Giles Hooker

This paper examines the stability of learned explanations for black-box predictions via model distillation with decision trees. One approach to intelligibility in machine learning is to use an understandable `student' model to mimic the output of an accurate `teacher'. Here, we consider the use of regression trees as a student model, in which nodes of the tree can be used as `explanations' for particular predictions, and the whole structure of the tree can be used as a global representation of the resulting function. However, individual trees are sensitive to the particular data sets used to train them, and an interpretation of a student model may be suspect if small changes in the training data have a large effect on it. In this context, access to outcomes from a teacher helps to stabilize the greedy splitting strategy by generating a much larger corpus of training examples than was originally available. We develop tests to ensure that enough examples are generated at each split so that the same splitting rule would be chosen with high probability were the tree to be re trained. Further, we develop a stopping rule to indicate how deep the tree should be built based on recent results on the variability of Random Forests when these are used as the teacher. We provide concrete examples of these procedures on the CAD-MDD and COMPAS data sets.

MEJun 26, 2018
Boulevard: Regularized Stochastic Gradient Boosted Trees and Their Limiting Distribution

Yichen Zhou, Giles Hooker

This paper examines a novel gradient boosting framework for regression. We regularize gradient boosted trees by introducing subsampling and employ a modified shrinkage algorithm so that at every boosting stage the estimate is given by an average of trees. The resulting algorithm, titled Boulevard, is shown to converge as the number of trees grows. We also demonstrate a central limit theorem for this limit, allowing a characterization of uncertainty for predictions. A simulation study and real world examples provide support for both the predictive accuracy of the model and its limiting behavior.

MLMar 21, 2018
Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and its Variance Estimate

Indrayudh Ghosal, Giles Hooker

In this paper we propose using the principle of boosting to reduce the bias of a random forest prediction in the regression setting. From the original random forest fit we extract the residuals and then fit another random forest to these residuals. We call the sum of these two random forests a \textit{one-step boosted forest}. We show with simulated and real data that the one-step boosted forest has a reduced bias compared to the original random forest. The paper also provides a variance estimate of the one-step boosted forest by an extension of the infinitesimal Jackknife estimator. Using this variance estimate we can construct prediction intervals for the boosted forest and we show that they have good coverage probabilities. Combining the bias reduction and the variance estimate we show that the one-step boosted forest has a significant reduction in predictive mean squared error and thus an improvement in predictive performance. When applied on datasets from the UCI database, one-step boosted forest performs better than random forest and gradient boosting machine algorithms. Theoretically we can also extend such a boosting process to more than one step and the same principles outlined in this paper can be used to find variance estimates for such predictors. Such boosting will reduce bias even further but it risks over-fitting and also increases the computational burden.

MLJan 26, 2018
Considerations When Learning Additive Explanations for Black-Box Models

Sarah Tan, Giles Hooker, Paul Koch et al.

Many methods to explain black-box models, whether local or global, are additive. In this paper, we study global additive explanations for non-additive models, focusing on four explanation methods: partial dependence, Shapley explanations adapted to a global setting, distilled additive explanations, and gradient-based explanations. We show that different explanation methods characterize non-additive components in a black-box model's prediction function in different ways. We use the concepts of main and total effects to anchor additive explanations, and quantitatively evaluate additive and non-additive explanations. Even though distilled explanations are generally the most accurate additive explanations, non-additive explanations such as tree explanations that explicitly model non-additive components tend to be even more accurate. Despite this, our user study showed that machine learning practitioners were better able to leverage additive explanations for various tasks. These considerations should be taken into account when considering which explanation to trust and use to explain black-box models.

MLNov 19, 2017
A Double Parametric Bootstrap Test for Topic Models

Skyler Seto, Sarah Tan, Giles Hooker et al.

Non-negative matrix factorization (NMF) is a technique for finding latent representations of data. The method has been applied to corpora to construct topic models. However, NMF has likelihood assumptions which are often violated by real document corpora. We present a double parametric bootstrap test for evaluating the fit of an NMF-based topic model based on the duality of the KL divergence and Poisson maximum likelihood estimation. The test correctly identifies whether a topic model based on an NMF approach yields reliable results in simulated and real data.

MLOct 17, 2017
Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation

Sarah Tan, Rich Caruana, Giles Hooker et al.

Black-box risk scoring models permeate our lives, yet are typically proprietary or opaque. We propose Distill-and-Compare, a model distillation and comparison approach to audit such models. To gain insight into black-box models, we treat them as teachers, training transparent student models to mimic the risk scores assigned by black-box models. We compare the student model trained with distillation to a second un-distilled transparent model trained on ground-truth outcomes, and use differences between the two models to gain insight into the black-box model. Our approach can be applied in a realistic setting, without probing the black-box model API. We demonstrate the approach on four public data sets: COMPAS, Stop-and-Frisk, Chicago Police, and Lending Club. We also propose a statistical test to determine if a data set is missing key features used to train the black-box model. Our test finds that the ProPublica data is likely missing key feature(s) used in COMPAS.

MLApr 15, 2017
Machine Learning and the Future of Realism

Giles Hooker, Cliff Hooker

The preceding three decades have seen the emergence, rise, and proliferation of machine learning (ML). From half-recognised beginnings in perceptrons, neural nets, and decision trees, algorithms that extract correlations (that is, patterns) from a set of data points have broken free from their origin in computational cognition to embrace all forms of problem solving, from voice recognition to medical diagnosis to automated scientific research and driverless cars, and it is now widely opined that the real industrial revolution lies less in mobile phone and similar than in the maturation and universal application of ML. Among the consequences just might be the triumph of anti-realism over realism.

MLNov 22, 2016
Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable

Sarah Tan, Matvey Soloviev, Giles Hooker et al.

Ensembles of decision trees perform well on many problems, but are not interpretable. In contrast to existing approaches in interpretability that focus on explaining relationships between features and predictions, we propose an alternative approach to interpret tree ensemble classifiers by surfacing representative points for each class -- prototypes. We introduce a new distance for Gradient Boosted Tree models, and propose new, adaptive prototype selection methods with theoretical guarantees, with the flexibility to choose a different number of prototypes in each class. We demonstrate our methods on random forests and gradient boosted trees, showing that the prototypes can perform as well as or even better than the original tree ensemble when used as a nearest-prototype classifier. In a user study, humans were better at predicting the output of a tree ensemble classifier when using prototypes than when using Shapley values, a popular feature attribution method. Hence, prototypes present a viable alternative to feature-based explanations for tree ensembles.

MLJun 1, 2015
Bootstrap Bias Corrections for Ensemble Methods

Giles Hooker, Lucas Mentch

This paper examines the use of a residual bootstrap for bias correction in machine learning regression methods. Accounting for bias is an important obstacle in recent efforts to develop statistical inference for machine learning methods. We demonstrate empirically that the proposed bootstrap bias correction can lead to substantial improvements in both bias and predictive accuracy. In the context of ensembles of trees, we show that this correction can be approximated at only double the cost of training the original ensemble without introducing additional variance. Our method is shown to improve test-set accuracy over random forests by up to 70\% on example problems from the UCI repository.

MLJun 7, 2014
Formal Hypothesis Tests for Additive Structure in Random Forests

Lucas Mentch, Giles Hooker

While statistical learning methods have proved powerful tools for predictive modeling, the black-box nature of the models they produce can severely limit their interpretability and the ability to conduct formal inference. However, the natural structure of ensemble learners like bagged trees and random forests has been shown to admit desirable asymptotic properties when base learners are built with proper subsamples. In this work, we demonstrate that by defining an appropriate grid structure on the covariate space, we may carry out formal hypothesis tests for both variable importance and underlying additive model structure. To our knowledge, these tests represent the first statistical tools for investigating the underlying regression structure in a context such as random forests. We develop notions of total and partial additivity and further demonstrate that testing can be carried out at no additional computational cost by estimating the variance within the process of constructing the ensemble. Furthermore, we propose a novel extension of these testing procedures utilizing random projections in order to allow for computationally efficient testing procedures that retain high power even when the grid size is much larger than that of the training set.

MLApr 25, 2014
Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

Lucas Mentch, Giles Hooker

This work develops formal statistical inference procedures for machine learning ensemble methods. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we consider predicting by averaging over trees built on subsamples of the training set and demonstrate that the resulting estimator takes the form of a U-statistic. As such, predictions for individual feature vectors are asymptotically normal, allowing for confidence intervals to accompany predictions. In practice, a subset of subsamples is used for computational speed; here our estimators take the form of incomplete U-statistics and equivalent results are derived. We further demonstrate that this setup provides a framework for testing the significance of features. Moreover, the internal estimation method we develop allows us to estimate the variance parameters and perform these inference procedures at no additional computational cost. Simulations and illustrations on a real dataset are provided.