Statistical Inference in Classification of High-dimensional Gaussian Mixture
This work addresses statistical inference challenges in high-dimensional classification for researchers in machine learning and statistics, offering incremental theoretical insights and practical tools for variable selection.
The paper tackles the classification problem of high-dimensional Gaussian mixtures with general covariances by analyzing regularized convex classifiers using the replica method, deriving asymptotic results for generalization error and variable selection, and validating findings with computational experiments on L1-regularized logistic regression.
We consider the classification problem of a high-dimensional mixture of two Gaussians with general covariance matrices. Using the replica method from statistical physics, we investigate the asymptotic behavior of a general class of regularized convex classifiers in the high-dimensional limit, where both the sample size $n$ and the dimension $p$ approach infinity while their ratio $α=n/p$ remains fixed. Our focus is on the generalization error and variable selection properties of the estimators. Specifically, based on the distributional limit of the classifier, we construct a de-biased estimator to perform variable selection through an appropriate hypothesis testing procedure. Using $L_1$-regularized logistic regression as an example, we conducted extensive computational experiments to confirm that our analytical findings are consistent with numerical simulations in finite-sized systems. We also explore the influence of the covariance structure on the performance of the de-biased estimator.