LG AI MLNov 15, 2022

Model free variable importance for high dimensional data

Naofumi Hama, Masayoshi Mase, Art B. Owen

arXiv:2211.08414v24.64 citationsh-index: 53Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for scalable variable importance analysis in high-dimensional settings, particularly when prediction functions are proprietary or expensive, though it is incremental as it builds on existing Shapley and integrated gradient methods.

The authors tackled the problem of efficiently computing model-free variable importance for high-dimensional data, introducing IGCS, which reduces computational cost from exponential to linear while maintaining accuracy comparable to existing methods, as shown by achieving similar area between curves (ABCs) in high-energy physics and higher ABCs in a 1024-variable chemistry problem.

A model-agnostic variable importance method can be used with arbitrary prediction functions. Here we present some model-free methods that do not require access to the prediction function. This is useful when that function is proprietary and not available, or just extremely expensive. It is also useful when studying residuals from a model. The cohort Shapley (CS) method is model-free but has exponential cost in the dimension of the input space. A supervised on-manifold Shapley method from Frye et al. (2020) is also model free but requires as input a second black box model that has to be trained for the Shapley value problem. We introduce an integrated gradient (IG) version of cohort Shapley, called IGCS, with cost $\mathcal{O}(nd)$. We show that over the vast majority of the relevant unit cube that the IGCS value function is close to a multilinear function for which IGCS matches CS. Another benefit of IGCS is that is allows IG methods to be used with binary predictors. We use some area between curves (ABC) measures to quantify the performance of IGCS. On a problem from high energy physics we verify that IGCS has nearly the same ABCs as CS does. We also use it on a problem from computational chemistry in 1024 variables. We see there that IGCS attains much higher ABCs than we get from Monte Carlo sampling. The code is publicly available at https://github.com/cohortshapley/cohortintgrad

View on arXiv PDF Code

Similar