LGCCDSMLMay 7, 2025

Feature Selection and Junta Testing are Statistically Equivalent

arXiv:2505.04604v21 citationsh-index: 1
Originality Incremental advance
AI Analysis

This provides a theoretical foundation for feature selection in machine learning, though it is incremental as it formalizes equivalence between known problems.

The paper tackles the problem of determining whether a function depends on only k variables (junta testing) and identifying those variables (feature selection), proving that these tasks are statistically equivalent. It shows that a brute-force algorithm is sample-optimal for both, with an optimal sample size of Θ(1/ε (√(2^k log C(n,k)) + log C(n,k))).

For a function $f \colon \{0,1\}^n \to \{0,1\}$, the junta testing problem asks whether $f$ depends on only $k$ variables. If $f$ depends on only $k$ variables, the feature selection problem asks to find those variables. We prove that these two tasks are statistically equivalent. Specifically, we show that the ``brute-force'' algorithm, which checks for any set of $k$ variables consistent with the sample, is simultaneously sample-optimal for both problems, and the optimal sample size is \[ Θ\left(\frac 1 \varepsilon \left( \sqrt{2^k \log {n \choose k}} + \log {n \choose k}\right)\right). \]

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes