Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality
This work addresses the scalability bottleneck in sparse PCA for researchers and practitioners needing interpretable dimensionality reduction in fields like finance and medicine, offering incremental improvements in computational efficiency.
The authors tackled the problem of scaling sparse PCA to certifiable optimality, achieving exact solutions for selecting 5 covariates from 300 variables and providing bound gaps of 1-2% for larger datasets up to thousands of variables.
Sparse principal component analysis (PCA) is a popular dimensionality reduction technique for obtaining principal components which are linear combinations of a small subset of the original features. Existing approaches cannot supply certifiably optimal principal components with more than $p=100s$ of variables. By reformulating sparse PCA as a convex mixed-integer semidefinite optimization problem, we design a cutting-plane method which solves the problem to certifiable optimality at the scale of selecting k=5 covariates from p=300 variables, and provides small bound gaps at a larger scale. We also propose a convex relaxation and greedy rounding scheme that provides bound gaps of $1-2\%$ in practice within minutes for $p=100$s or hours for $p=1,000$s and is therefore a viable alternative to the exact method at scale. Using real-world financial and medical datasets, we illustrate our approach's ability to derive interpretable principal components tractably at scale.