LGNov 15, 2022Code
Model free variable importance for high dimensional dataNaofumi Hama, Masayoshi Mase, Art B. Owen
A model-agnostic variable importance method can be used with arbitrary prediction functions. Here we present some model-free methods that do not require access to the prediction function. This is useful when that function is proprietary and not available, or just extremely expensive. It is also useful when studying residuals from a model. The cohort Shapley (CS) method is model-free but has exponential cost in the dimension of the input space. A supervised on-manifold Shapley method from Frye et al. (2020) is also model free but requires as input a second black box model that has to be trained for the Shapley value problem. We introduce an integrated gradient (IG) version of cohort Shapley, called IGCS, with cost $\mathcal{O}(nd)$. We show that over the vast majority of the relevant unit cube that the IGCS value function is close to a multilinear function for which IGCS matches CS. Another benefit of IGCS is that is allows IG methods to be used with binary predictors. We use some area between curves (ABC) measures to quantify the performance of IGCS. On a problem from high energy physics we verify that IGCS has nearly the same ABCs as CS does. We also use it on a problem from computational chemistry in 1024 variables. We see there that IGCS attains much higher ABCs than we get from Monte Carlo sampling. The code is publicly available at https://github.com/cohortshapley/cohortintgrad
STMar 21, 2017
On Shapley value for measuring importance of dependent inputsArt B. Owen, Clémentine Prieur
This paper makes the case for using Shapley value to quantify the importance of random input variables to a function. Alternatives based on the ANOVA decomposition can run into conceptual and computational problems when the input variables are dependent. Our main goal here is to show that Shapley value removes the conceptual problems. We do this with some simple examples where Shapley value leads to intuitively reasonable nearly closed form values.
NAMay 8, 2012
Variance components and generalized Sobol' indicesArt B. Owen
This paper introduces generalized Sobol' indices, compares strategies for their estimation, and makes a systematic search for efficient estimators. Of particular interest are contrasts, sums of squares and indices of bilinear form which allow a reduced number of function evaluations compared to alternatives. The bilinear framework includes some efficient estimators from Saltelli (2002) and Mauntz (2002) as well as some new estimators for specific variance components and mean dimensions. This paper also provides a bias corrected version of the estimator of Janon et al.\,(2012) and extends the bias correction to generalized Sobol' indices. Some numerical comparisons are given.
MEApr 21, 2012
Better estimation of small Sobol' sensitivity indicesArt B. Owen
A new method for estimating Sobol' indices is proposed. The new method makes use of 3 independent input vectors rather than the usual 2. It attains much greater accuracy on problems where the target Sobol' index is small, even outperforming some oracles which adjust using the true but unknown mean of the function. When the target Sobol' index is quite large, the oracles do better than the new method.
CODec 9, 2015
Transformations and Hardy-Krause variationKinjal Basu, Art B. Owen
Using a multivariable Faa di Bruno formula we give conditions on transformations $τ:[0,1]^m\to\mathcal{X}$ where $\mathcal{X}$ is a closed and bounded subset of $\mathbb{R}^d$ such that $f\circτ$ is of bounded variation in the sense of Hardy and Krause for all $f\in C^d(\mathcal{x})$. We give similar conditions for $f\circτ$ to be smooth enough for scrambled net sampling to attain $O(n^{-3/2+ε})$ accuracy. Some popular symmetric transformations to the simplex and sphere are shown to satisfy neither condition. Some other transformations due to Fang and Wang (1993) satisfy the first but not the second condition. We provide transformations for the simplex that makes $f\circτ$ smooth enough to fully benefit from scrambled net sampling for all $f$ in a class of generalized polynomials. We also find sufficient conditions for the Rosenblatt-Hlawka-Mück transformation in $\mathbb{R}^2$ and for importance sampling to be of bounded variation in the sense of Hardy and Krause.
LGMay 25, 2022
Deletion and Insertion Tests in Regression ModelsNaofumi Hama, Masayoshi Mase, Art B. Owen
A basic task in explainable AI (XAI) is to identify the most important features behind a prediction made by a black box function $f$. The insertion and deletion tests of Petsiuk et al. (2018) can be used to judge the quality of algorithms that rank pixels from most to least important for a classification. Motivated by regression problems we establish a formula for their area under the curve (AUC) criteria in terms of certain main effects and interactions in an anchored decomposition of $f$. We find an expression for the expected value of the AUC under a random ordering of inputs to $f$ and propose an alternative area above a straight line for the regression setting. We use this criterion to compare feature importances computed by integrated gradients (IG) to those computed by Kernel SHAP (KS) as well as LIME, DeepLIFT, vanilla gradient and input$\times$gradient methods. KS has the best overall performance in two datasets we consider but it is very expensive to compute. We find that IG is nearly as good as KS while being much faster. Our comparison problems include some binary inputs that pose a challenge to IG because it must use values between the possible variable levels and so we consider ways to handle binary variables in IG. We show that sorting variables by their Shapley value does not necessarily give the optimal ordering for an insertion-deletion test. It will however do that for monotone functions of additive models, such as logistic regression.
NADec 3, 2018
Effective dimension of some weighted pre-Sobolev spaces with dominating mixed partial derivativesArt B. Owen
This paper considers two notions of effective dimension for quadrature in weighted pre-Sobolev spaces with dominating mixed partial derivatives. We begin by finding a ball in those spaces just barely large enough to contain a function with unit variance. If no function in that ball has more than $\varepsilon$ of its variance from ANOVA components involving interactions of order $s$ or more, then the space has effective dimension at most $s$ in the superposition sense. A similar truncation sense notion replaces the cardinality of the ANOVA component by the largest index it contains. Some Poincaré type inequalities are used to bound variance components by multiples of these space's squared norm and those in turn provide bounds on effective dimension. Very low effective dimension in the superposition sense holds for some spaces defined by product weights in which quadrature is strongly tractable. The superposition dimension is $O( \log(1/\varepsilon)/\log(\log(1/\varepsilon)))$ just like the superposition dimension used in the multidimensional decomposition method. Surprisingly, even spaces where all subset weights are equal, regardless of their cardinality or included indices, have low superposition dimension in this sense. This paper does not require periodicity of the integrands.
LGMay 31, 2022
Variable importance without impossible dataMasayoshi Mase, Art B. Owen, Benjamin B. Seiler
The most popular methods for measuring importance of the variables in a black box prediction algorithm make use of synthetic inputs that combine predictor variables from multiple subjects. These inputs can be unlikely, physically impossible, or even logically impossible. As a result, the predictions for such cases can be based on data very unlike any the black box was trained on. We think that users cannot trust an explanation of the decision of a prediction algorithm when the explanation uses such values. Instead we advocate a method called Cohort Shapley that is grounded in economic game theory and unlike most other game theoretic methods, it uses only actually observed data to quantify variable importance. Cohort Shapley works by narrowing the cohort of subjects judged to be similar to a target subject on one or more features. We illustrate it on an algorithmic fairness problem where it is essential to attribute importance to protected variables that the model was not trained on.
76.4NAMay 13
Walk on spheres and Array-RQMCValerie N. P. Ho, Art B. Owen
We use Array-RQMC sampling in a walk on spheres (WOS) algorithm for Dirichlet boundary value problems. On a collection of problems, we find that Array-RQMC-WOS reduces the Monte Carlo variance by factors ranging from $57$-fold to $2290$-fold at $n=2^{17}$ trajectories. The variance is known to be $o(1/n)$ but attains empirical rates between $n^{-1.4}$ and $n^{-1.8}$ in our examples. A simpler RQMC-WOS algorithm studied in Ho and Owen (2026) has more theoretical support but only reduced variance by 1.8 to 10.7-fold on the same set of examples. In order to explain this improvement, we introduce a column-wise mean dimension of the RQMC error based on Sobol' indices. It matches the usual mean dimension for Monte Carlo and the mean dimension of a dual lattice error for randomized lattices. We find for a gasket example from Crane et al.\ (2025) that the mean dimension of Array-RQMC-WOS errors is much higher than an analogous Array-MC-WOS algorithm has.
43.5NAMay 8
Randomized quasi-Monte Carlo for walk on spheresValerie N. P. Ho, Art B. Owen
We investigate the use of randomized quasi-Monte Carlo (RQMC) in walk on spheres algorithms to solve boundary value problems for functions with Dirichlet boundary conditions in $\mathbb{R}^d$. For harmonic functions with $d=2$, the integrands of interest are periodic indicator functions over regions $Θ$ in the torus $\mathbb{T}^k$. We give conditions for $\partialΘ$ to have $k-1$ dimensional Minkowski content which allows us to use results of He and Wang (2015). The RQMC estimates involve multiple values of $k$. We see sampling variances decreasing with the number $n$ of sample points at slightly better than Monte Carlo rates. The median variance rate in $4$ RQMC methods over $5$ worked examples, including some with $d=3$ and some with nonzero source functions, was slightly better than $O(n^{-1.1})$. The variance reduction factors ranged from $1.8$ to $10.7$ at $n=2^{17}$. None of the four RQMC methods dominated the others.
LGJul 27, 2021
Probing neural networks with t-SNE, class-specific projections and a guided tourChristopher R. Hoyt, Art B. Owen
We use graphical methods to probe neural nets that classify images. Plots of t-SNE outputs at successive layers in a network reveal increasingly organized arrangement of the data points. They can also reveal how a network can diminish or even forget about within-class structure as the data proceeds through layers. We use class-specific analogues of principal components to visualize how succeeding layers separate the classes. These allow us to sort images from a given class from most typical to least typical (in the data) and they also serve as very useful projection coordinates for data visualization. We find them especially useful when defining versions guided tours for animated data visualization.
APMay 17, 2021
What makes you unique?Benjamin B. Seiler, Masayoshi Mase, Art B. Owen
This paper proposes a uniqueness Shapley measure to compare the extent to which different variables are able to identify a subject. Revealing the value of a variable on subject $t$ shrinks the set of possible subjects that $t$ could be. The extent of the shrinkage depends on which other variables have also been revealed. We use Shapley value to combine all of the reductions in log cardinality due to revealing a variable after some subset of the other variables has been revealed. This uniqueness Shapley measure can be aggregated over subjects where it becomes a weighted sum of conditional entropies. Aggregation over subsets of subjects can address questions like how identifying is age for people of a given zip code. Such aggregates have a corresponding expression in terms of cross entropies. We use uniqueness Shapley to investigate the differential effects of revealing variables from the North Carolina voter registration rolls and in identifying anomalous solar flares. An enormous speedup (approaching 2000 fold in one example) is obtained by using the all dimension trees of Moore and Lee (1998) to store the cardinalities we need.
LGMay 15, 2021
Cohort Shapley value for algorithmic fairnessMasayoshi Mase, Art B. Owen, Benjamin B. Seiler
Cohort Shapley value is a model-free method of variable importance grounded in game theory that does not use any unobserved and potentially impossible feature combinations. We use it to evaluate algorithmic fairness, using the well known COMPAS recidivism data as our example. This approach allows one to identify for each individual in a data set the extent to which they were adversely or beneficially affected by their value of a protected attribute such as their race. The method can do this even if race was not one of the original predictors and even if it does not have access to a proprietary algorithm that has made the predictions. The grounding in game theory lets us define aggregate variable importance for a data set consistently with its per subject definitions. We can investigate variable importance for multiple quantities of interest in the fairness literature including false positive predictions.
LGApr 7, 2021
Quasi-Newton Quasi-Monte Carlo for variational BayesSifan Liu, Art B. Owen
Many machine learning problems optimize an objective that must be measured with noise. The primary method is a first order stochastic gradient descent using one or more Monte Carlo (MC) samples at each step. There are settings where ill-conditioning makes second order methods such as L-BFGS more effective. We study the use of randomized quasi-Monte Carlo (RQMC) sampling for such problems. When MC sampling has a root mean squared error (RMSE) of $O(n^{-1/2})$ then RQMC has an RMSE of $o(n^{-1/2})$ that can be close to $O(n^{-3/2})$ in favorable settings. We prove that improved sampling accuracy translates directly to improved optimization. In our empirical investigations for variational Bayes, using RQMC with stochastic L-BFGS greatly speeds up the optimization, and sometimes finds a better parameter value than MC does.
MEJul 2, 2020
Efficient estimation of the ANOVA mean dimension, with an application to neural net classificationChristopher Hoyt, Art B. Owen
The mean dimension of a black box function of $d$ variables is a convenient way to summarize the extent to which it is dominated by high or low order interactions. It is expressed in terms of $2^d-1$ variance components but it can be written as the sum of $d$ Sobol' indices that can be estimated by leave one out methods. We compare the variance of these leave one out methods: a Gibbs sampler called winding stairs, a radial sampler that changes each variable one at a time from a baseline, and a naive sampler that never reuses function evaluations and so costs about double the other methods. For an additive function the radial and winding stairs are most efficient. For a multiplicative function the naive method can easily be most efficient if the factors have high kurtosis. As an illustration we consider the mean dimension of a neural network classifier of digits from the MNIST data set. The classifier is a function of $784$ pixels. For that problem, winding stairs is the best algorithm. We find that inputs to the final softmax layer have mean dimensions ranging from $1.35$ to $2.0$.
LGNov 1, 2019
Explaining black box decisions by Shapley cohort refinementMasayoshi Mase, Art B. Owen, Benjamin Seiler
We introduce a variable importance measure to quantify the impact of individual input variables to a black box function. Our measure is based on the Shapley value from cooperative game theory. Many measures of variable importance operate by changing some predictor values with others held fixed, potentially creating unlikely or even logically impossible combinations. Our cohort Shapley measure uses only observed data points. Instead of changing the value of a predictor we include or exclude subjects similar to the target subject on that predictor to form a similarity cohort. Then we apply Shapley value to the cohort averages. We connect variable importance measures from explainable AI to function decompositions from global sensitivity analysis. We introduce a squared cohort Shapley value that splits previously studied Shapley effects over subjects, consistent with a Shapley axiom.
COOct 27, 2015
Statistically efficient thinning of a Markov chain samplerArt B. Owen
It is common to subsample Markov chain output to reduce the storage burden. Geyer (1992) shows that discarding $k-1$ out of every $k$ observations will not improve statistical efficiency, as quantified through variance in a given computational budget. That observation is often taken to mean that thinning MCMC output cannot improve statistical efficiency. Here we suppose that it costs one unit of time to advance a Markov chain and then $θ>0$ units of time to compute a sampled quantity of interest. For a thinned process, that cost $θ$ is incurred less often, so it can be advanced through more stages. Here we provide examples to show that thinning will improve statistical efficiency if $θ$ is large and the sample autocorrelations decay slowly enough. If the lag $\ell\ge1$ autocorrelations of a scalar measurement satisfy $ρ_\ell\geρ_{\ell+1}\ge0$, then there is always a $θ<\infty$ at which thinning becomes more efficient for averages of that scalar. Many sample autocorrelation functions resemble first order AR(1) processes with $ρ_\ell =ρ^{|\ell|}$ for some $-1<ρ<1$. For an AR(1) process it is possible to compute the most efficient subsampling frequency $k$. The optimal $k$ grows rapidly as $ρ$ increases towards $1$. The resulting efficiency gain depends primarily on $θ$, not $ρ$. Taking $k=1$ (no thinning) is optimal when $ρ\le0$. For $ρ>0$ it is optimal if and only if $θ\le (1-ρ)^2/(2ρ)$. This efficiency gain never exceeds $1+θ$. This paper also gives efficiency bounds for autocorrelations bounded between those of two AR(1) processes.
CONov 14, 2014
Optimal mixture weights in multiple importance samplingHera Y. He, Art B. Owen
In multiple importance sampling we combine samples from a finite list of proposal distributions. When those proposal distributions are used to create control variates, it is possible (Owen and Zhou, 2000) to bound the ratio of the resulting variance to that of the unknown best proposal distribution in our list. The minimax regret arises by taking a uniform mixture of proposals, but that is conservative when there are many components. In this paper we optimize the mixture component sampling rates to gain further efficiency. We show that the sampling variance of mixture importance sampling with control variates is jointly convex in the mixture probabilities and control variate regression coefficients. We also give a sequential importance sampling algorithm to estimate the optimal mixture from the sample data.