ST LG MLFeb 24, 2015

On the consistency theory of high dimensional variable screening

Xiangyu Wang, Chenlei Leng, David B. Dunson

arXiv:1502.06895v33.39 citations

Originality Incremental advance

AI Analysis

This work provides theoretical guarantees for variable screening in high-dimensional statistics, which is crucial for efficient feature selection in big data applications, though it is incremental as it builds on existing linear screening methods.

The paper tackles the problem of ensuring variable screening methods correctly retain important features in high-dimensional settings where the number of variables p is much larger than the sample size n. It establishes that the restricted diagonally dominant (RDD) condition is necessary and sufficient for strong screening consistency, with examples showing methods like SIS and HOLP achieve this with high probability under specific sample size conditions.

Variable screening is a fast dimension reduction technique for assisting high dimensional feature selection. As a preselection method, it selects a moderate size subset of candidate variables for further refining via feature selection to produce the final model. The performance of variable screening depends on both computational efficiency and the ability to dramatically reduce the number of variables without discarding the important ones. When the data dimension $p$ is substantially larger than the sample size $n$, variable screening becomes crucial as 1) Faster feature selection algorithms are needed; 2) Conditions guaranteeing selection consistency might fail to hold. This article studies a class of linear screening methods and establishes consistency theory for this special class. In particular, we prove the restricted diagonally dominant (RDD) condition is a necessary and sufficient condition for strong screening consistency. As concrete examples, we show two screening methods $SIS$ and $HOLP$ are both strong screening consistent (subject to additional constraints) with large probability if $n > O((ρs + σ/τ)^2\log p)$ under random designs. In addition, we relate the RDD condition to the irrepresentable condition, and highlight limitations of $SIS$.

View on arXiv PDF

Similar