LG MEJun 20, 2023

A Model-free Closeness-of-influence Test for Features in Supervised Learning

arXiv:2306.11855v13.81 citationsh-index: 37

Originality Incremental advance

AI Analysis

This provides a tool for feature analysis in high-dimensional data, but it is incremental as it builds on existing frameworks like datamodels.

The paper tackles the problem of assessing whether two features have similar influence on a response variable in supervised learning, proposing a model-free statistical test that controls type I error and achieves full power under certain conditions, with validation on CIFAR-10 using datamodels.

Understanding the effect of a feature vector $x \in \mathbb{R}^d$ on the response value (label) $y \in \mathbb{R}$ is the cornerstone of many statistical learning problems. Ideally, it is desired to understand how a set of collected features combine together and influence the response value, but this problem is notoriously difficult, due to the high-dimensionality of data and limited number of labeled data points, among many others. In this work, we take a new perspective on this problem, and we study the question of assessing the difference of influence that the two given features have on the response value. We first propose a notion of closeness for the influence of features, and show that our definition recovers the familiar notion of the magnitude of coefficients in the parametric model. We then propose a novel method to test for the closeness of influence in general model-free supervised learning problems. Our proposed test can be used with finite number of samples with control on type I error rate, no matter the ground truth conditional law $\mathcal{L}(Y |X)$. We analyze the power of our test for two general learning problems i) linear regression, and ii) binary classification under mixture of Gaussian models, and show that under the proper choice of score function, an internal component of our test, with sufficient number of samples will achieve full statistical power. We evaluate our findings through extensive numerical simulations, specifically we adopt the datamodel framework (Ilyas, et al., 2022) for CIFAR-10 dataset to identify pairs of training samples with different influence on the trained model via optional black box training mechanisms.

View on arXiv PDF

Similar