MLSep 12, 2023
A Consistent and Scalable Algorithm for Best Subset Selection in Single Index ModelsBorui Tang, Jin Zhu, Junxian Zhu et al.
Analysis of high-dimensional data has led to increased interest in both single index models (SIMs) and the best-subset selection. SIMs provide an interpretable and flexible modeling framework for high-dimensional data, while the best-subset selection aims to find a sparse model from a large set of predictors. However, the best-subset selection in high-dimensional models is known to be computationally intractable. Existing proxy algorithms are appealing but do not yield the bestsubset solution. In this paper, we directly tackle the intractability by proposing a provably scalable algorithm for the best-subset selection in high-dimensional SIMs. We directly proved the subset selection consistency and oracle property for our algorithmic solution, distinguishing it from other state-of-the-art support recovery methods in SIMs. The algorithm comprises a generalized information criterion to determine the support size of the regression coefficients, eliminating the model selection tuning. Moreover, our method does not assume an error distribution or a specific link function and hence is flexible to apply. Extensive simulation results demonstrate that our method is not only computationally efficient but also able to exactly recover the best subset in various settings (e.g., linear regression, Poisson regression, heteroscedastic models).
MENov 29, 2022
Simultaneous Best Subset Selection and Dimension Reduction via Primal-Dual IterationsCanhong Wen, Ruipeng Dong, Xueqin Wang et al.
Sparse reduced rank regression is an essential statistical learning method. In the contemporary literature, estimation is typically formulated as a nonconvex optimization that often yields to a local optimum in numerical computation. Yet, their theoretical analysis is always centered on the global optimum, resulting in a discrepancy between the statistical guarantee and the numerical computation. In this research, we offer a new algorithm to address the problem and establish an almost optimal rate for the algorithmic solution. We also demonstrate that the algorithm achieves the estimation with a polynomial number of iterations. In addition, we present a generalized information criterion to simultaneously ensure the consistency of support set recovery and rank estimation. Under the proposed criterion, we show that our algorithm can achieve the oracle reduced rank estimation with a significant probability. The numerical studies and an application in the ovarian cancer genetic data demonstrate the effectiveness and scalability of our approach.
MENov 28, 2020
Optimal and Safe Estimation for High-Dimensional Semi-Supervised LearningSiyi Deng, Yang Ning, Jiwei Zhao et al.
We consider the estimation problem in high-dimensional semi-supervised learning. Our goal is to investigate when and how the unlabeled data can be exploited to improve the estimation of the regression parameters of linear model in light of the fact that such linear models may be misspecified in data analysis. We first establish the minimax lower bound for parameter estimation in the semi-supervised setting, and show that this lower bound cannot be achieved by supervised estimators using the labeled data only. We propose an optimal semi-supervised estimator that can attain this lower bound and therefore improves the supervised estimators, provided that the conditional mean function can be consistently estimated with a proper rate. We further propose a safe semi-supervised estimator. We view it safe, because this estimator is always at least as good as the supervised estimators. We also extend our idea to the aggregation of multiple semi-supervised estimators caused by different misspecifications of the conditional mean function. Extensive numerical simulations and a real data analysis are conducted to illustrate our theoretical results.