Laurent El Ghaoui

LG
h-index55
26papers
4,879citations
Novelty52%
AI Score53

26 Papers

LGMay 18, 2011
Safe Feature Elimination for the LASSO and Sparse Supervised Learning Problems

Laurent El Ghaoui, Vivian Viallon, Tarek Rabbani

We describe a fast method to eliminate features (variables) in l1 -penalized least-square regression (or LASSO) problems. The elimination of features leads to a potentially substantial reduction in running time, specially for large values of the penalty parameter. Our method is not heuristic: it only eliminates features that are guaranteed to be absent after solving the LASSO problem. The feature elimination step is easy to parallelize and can test each feature for elimination independently. Moreover, the computational effort of our method is negligible compared to that of solving the LASSO problem - roughly it is the same as single gradient step. Our method extends the scope of existing LASSO algorithms to treat larger data sets, previously out of their reach. We show how our method can be extended to general l1 -penalized convex problems and present preliminary results for the Sparse Support Vector Machine and Logistic Regression problems.

81.5SYApr 17
A Digital Twin Framework for Decision-Support and Optimization of EV Charging Infrastructure in Localized Urban Systems

Bui Khanh Linh Do, Thanh H. Nguyen, Nghi Huynh Quang et al.

As Electric Vehicle (EV) adoption accelerates in urban environments, optimizing charging infrastructure is vital for balancing user satisfaction, energy efficiency, and financial viability. This study advances beyond static models by proposing a digital twin framework that integrates agent-based decision support with embedded optimization to dynamically simulate EV charging behaviors, infrastructure layouts, and policy responses across scenarios. Applied to a localized urban site (a university campus) in Hanoi, Vietnam, the model evaluates operational policies, EV station configurations, and renewable energy sources. The interactive dashboard enables seasonal analysis, revealing a 20% drop in solar efficiency from October to March, with wind power contributing under 5% of demand, highlighting the need for adaptive energy management. Simulations show that dynamic notifications of newly available charging slots improve user satisfaction, while gasoline bans and idle fees enhance slot turnover with minimal added complexity. Embedded metaheuristic optimization identifies near-optimal mixes of fast (30kW) and standard (11kW) solar-powered chargers, balancing energy performance, profitability, and demand with high computational efficiency. This digital twin provides a flexible, computation-driven platform for EV infrastructure planning, with a transferable, modular design that enables seamless scaling from localized to city-wide urban contexts.

LGSep 19, 2022
State-driven Implicit Modeling for Sparsity and Robustness in Neural Networks

Alicia Y. Tsai, Juliette Decugis, Laurent El Ghaoui et al.

Implicit models are a general class of learning models that forgo the hierarchical layer structure typical in neural networks and instead define the internal states based on an ``equilibrium'' equation, offering competitive performance and reduced memory consumption. However, training such models usually relies on expensive implicit differentiation for backward propagation. In this work, we present a new approach to training implicit models, called State-driven Implicit Modeling (SIM), where we constrain the internal states and outputs to match that of a baseline model, circumventing costly backward computations. The training problem becomes convex by construction and can be solved in a parallel fashion, thanks to its decomposable structure. We demonstrate how the SIM approach can be applied to significantly improve sparsity (parameter reduction) and robustness of baseline models trained on FashionMNIST and CIFAR-100 datasets.

91.7SYApr 1
A Multi-Criterion Approach to Smart EV Charging with CO2 Emissions and Cost Minimization

Giuseppe C. Calafiore, Luca Ambrosino, Khai Manh Nguyen et al.

We study carbon-aware smart charging in a fossil-dominated grid by coupling a simplified hydro-thermal-renewable dispatch model with a tractable linear charging scheduler. The case study is informed by Vietnam's regional data. Thermal units remain dominant, renewables are time-varying, and hydropower is modeled through a single reservoir budget. From the day-ahead dispatch we derive hourly carbon intensity and a corresponding carbon-cost signal; these are combined with a local time-of-use tariff in the EV charging problem. The resulting weighted-sum linear program is multi-objective: by sweeping the trade-off coefficient, we recover the supported Pareto frontier between electricity cost and charging-associated emissions. In a 300-EV public-charging scenario with a 0.8 MW feeder cap, the proposed carbon-aware scheduler preserves the 19.8% bill reduction of a cost-only optimizer while lowering charging-associated emissions by 7.3%; a more carbon-focused tuning still remains 12.6% cheaper and 9.3% cleaner than a FIFO baseline. A hydro-sensitivity study shows that changing the reservoir budget by +/- 20% moves the mean grid carbon intensity from 360 to 466 g/kWh, yet the carbon-aware scheduler remains consistently cheaper and cleaner than FIFO. The dispatch and charging LPs solve in few milliseconds on a standard desktop computer, showing that the framework is lightweight enough for repeated day-ahead studies.

LGJul 19, 2024
The Extrapolation Power of Implicit Models

Juliette Decugis, Alicia Y. Tsai, Max Emerling et al.

In this paper, we investigate the extrapolation capabilities of implicit deep learning models in handling unobserved data, where traditional deep neural networks may falter. Implicit models, distinguished by their adaptability in layer depth and incorporation of feedback within their computational graph, are put to the test across various extrapolation scenarios: out-of-distribution, geographical, and temporal shifts. Our experiments consistently demonstrate significant performance advantage with implicit models. Unlike their non-implicit counterparts, which often rely on meticulous architectural design for each task, implicit models demonstrate the ability to learn complex model structures without the need for task-specific design, highlighting their robustness in handling unseen data.

CLAug 19, 2022
Sparse Optimization for Unsupervised Extractive Summarization of Long Documents with the Frank-Wolfe Algorithm

Alicia Y. Tsai, Laurent El Ghaoui · berkeley

We address the problem of unsupervised extractive document summarization, especially for long documents. We model the unsupervised problem as a sparse auto-regression one and approximate the resulting combinatorial problem via a convex, norm-constrained problem. We solve it using a dedicated Frank-Wolfe algorithm. To generate a summary with $k$ sentences, the algorithm only needs to execute $\approx k$ iterations, making it very efficient. We explain how to avoid explicit calculation of the full gradient and how to include sentence embedding information. We evaluate our approach against two other unsupervised methods using both lexical (standard) ROUGE scores, as well as semantic (embedding-based) ones. Our method achieves better results with both datasets and works especially well when combined with embeddings for highly paraphrased summaries.

OCFeb 27, 2020Code
Stochastic Frank-Wolfe for Constrained Finite-Sum Minimization

Geoffrey Négiar, Gideon Dresdner, Alicia Tsai et al.

We propose a novel Stochastic Frank-Wolfe (a.k.a. conditional gradient) algorithm for constrained smooth finite-sum minimization with a generalized linear prediction/structure. This class of problems includes empirical risk minimization with sparse, low-rank, or other structured constraints. The proposed method is simple to implement, does not require step-size tuning, and has a constant per-iteration cost that is independent of the dataset size. Furthermore, as a byproduct of the method we obtain a stochastic estimator of the Frank-Wolfe gap that can be used as a stopping criterion. Depending on the setting, the proposed method matches or improves on the best computational guarantees for Stochastic Frank-Wolfe algorithms. Benchmarks on several datasets highlight different regimes in which the proposed method exhibits a faster empirical convergence than related methods. Finally, we provide an implementation of all considered methods in an open-source package.

42.2AIApr 28
Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

An Nguyen, Hoang Nguyen, Phuong Le et al.

We study city-scale control of electric-vehicle (EV) ride-hailing fleets where dispatch, repositioning, and charging decisions must respect charger and feeder limits under uncertain, spatially correlated demand and travel times. We formulate the problem as a hex-grid semi-Markov decision process (semi-MDP) with mixed actions -- discrete actions for serving, repositioning, and charging, together with continuous charging power -- and variable action durations. To guarantee physical feasibility during both training and deployment, the policy learns over high-level intentions produced by a masked, temperature-annealed actor. These intentions are projected at every decision step through a time-limited rolling mixed-integer linear program (MILP) that strictly enforces state-of-charge, port, and feeder constraints. To mitigate distributional shifts, we optimize a Soft Actor--Critic (SAC) agent against a Wasserstein-1 ambiguity set with a graph-aligned Mahalanobis ground metric that captures spatial correlations. The robust backup uses the Kantorovich--Rubinstein dual, a projected subgradient inner loop, and a primal--dual risk-budget update. Our architecture combines a two-layer Graph Convolutional Network (GCN) encoder, twin critics, and a value network that drives the adversary. Experiments on a large-scale EV fleet simulator built from NYC taxi data show that PD--RSAC achieves the highest net profit, reaching \$1.22M, compared with \$0.58M--\$0.70M for strong heuristic, single-agent RL, and multi-agent RL baselines, including Greedy, SAC, MAPPO, and MADDPG, while maintaining zero feeder-limit violations.

LGMay 2, 2025
Tree-Sliced Wasserstein Distance with Nonlinear Projection

Thanh Tran, Viet-Hoang Tran, Thanh Chu et al.

Tree-Sliced methods have recently emerged as an alternative to the traditional Sliced Wasserstein (SW) distance, replacing one-dimensional lines with tree-based metric spaces and incorporating a splitting mechanism for projecting measures. This approach enhances the ability to capture the topological structures of integration domains in Sliced Optimal Transport while maintaining low computational costs. Building on this foundation, we propose a novel nonlinear projectional framework for the Tree-Sliced Wasserstein (TSW) distance, substituting the linear projections in earlier versions with general projections, while ensuring the injectivity of the associated Radon Transform and preserving the well-definedness of the resulting metric. By designing appropriate projections, we construct efficient metrics for measures on both Euclidean spaces and spheres. Finally, we validate our proposed metric through extensive numerical experiments for Euclidean and spherical datasets. Applications include gradient flows, self-supervised learning, and generative models, where our methods demonstrate significant improvements over recent SW and TSW variants.

SYDec 18, 2021
Data-Driven Reachability analysis and Support set Estimation with Christoffel Functions

Alex Devonport, Forest Yang, Laurent El Ghaoui et al.

We present algorithms for estimating the forward reachable set of a dynamical system using only a finite collection of independent and identically distributed samples. The produced estimate is the sublevel set of a function called an empirical inverse Christoffel function: empirical inverse Christoffel functions are known to provide good approximations to the support of probability distributions. In addition to reachability analysis, the same approach can be applied to general problems of estimating the support of a random variable, which has applications in data science towards detection of novelties and outliers in data sets. In applications where safety is a concern, having a guarantee of accuracy that holds on finite data sets is critical. In this paper, we prove such bounds for our algorithms under the Probably Approximately Correct (PAC) framework. In addition to applying classical Vapnik-Chervonenkis (VC) dimension bound arguments, we apply the PAC-Bayes theorem by leveraging a formal connection between kernelized empirical inverse Christoffel functions and Gaussian process regression models. The bound based on PAC-Bayes applies to a more general class of Christoffel functions than the VC dimension argument, and achieves greater sample efficiency in experiments.

ROSep 24, 2021
Learning-based Initialization Strategy for Safety of Multi-Vehicle Systems

Jennifer C. Shih, Akshara Rai, Laurent El Ghaoui

Multi-vehicle collision avoidance is a highly crucial problem due to the soaring interests of introducing autonomous vehicles into the real world in recent years. The safety of these vehicles while they complete their objectives is of paramount importance. Hamilton-Jacobi (HJ) reachability is a promising tool for guaranteeing safety for low-dimensional systems. However, due to its exponential complexity in computation time, no reachability-based methods have been able to guarantee safety for more than three vehicles successfully in unstructured scenarios. For systems with four or more vehicles,we can only empirically validate their safety performance.While reachability-based safety methods enjoy a flexible least-restrictive control strategy, it is challenging to reason about long-horizon trajectories online because safety at any given state is determined by looking up its safety value in a pre-computed table that does not exhibit favorable properties that continuous functions have. This motivates the problem of improving the safety performance of unstructured multi-vehicle systems when safety cannot be guaranteed given any least-restrictive safety-aware collision avoidance algorithm while avoiding online trajectory optimization. In this paper, we propose a novel approach using supervised learning to enhance the safety of vehicles by proposing new initial states in very close neighborhood of the original initial states of vehicles. Our experiments demonstrate the effectiveness of our proposed approach and show that vehicles are able to get to their goals with better safety performance with our approach compared to a baseline approach in wide-ranging scenarios.

SYSep 8, 2021
Recurrent Neural Network Controllers Synthesis with Stability Guarantees for Partially Observed Systems

Fangda Gu, He Yin, Laurent El Ghaoui et al.

Neural network controllers have become popular in control tasks thanks to their flexibility and expressivity. Stability is a crucial property for safety-critical dynamical systems, while stabilization of partially observed systems, in many cases, requires controllers to retain and process long-term memories of the past. We consider the important class of recurrent neural networks (RNN) as dynamic controllers for nonlinear uncertain partially-observed systems, and derive convex stability conditions based on integral quadratic constraints, S-lemma and sequential convexification. To ensure stability during the learning and control process, we propose a projected policy gradient method that iteratively enforces the stability conditions in the reparametrized space taking advantage of mild additional information on system dynamics. Numerical experiments show that our method learns stabilizing controllers while using fewer samples and achieving higher final performance compared with policy gradient.

ROAug 5, 2021
Reachability-based Safe Planning for Multi-Vehicle Systems withMultiple Targets

Jennifer C. Shih, Laurent El Ghaoui

Recently there have been a lot of interests in introducing UAVs for a wide range of applications, making ensuring safety of multi-vehicle systems a highly crucial problem. Hamilton-Jacobi (HJ) reachability is a promising tool for analyzing safety of vehicles for low-dimensional systems. However, reachability suffers from the curse of dimensionality, making its direct application to more than two vehicles intractable. Recent works have made it tractable to guarantee safety for 3 and 4 vehicles with reachability. However, the number of vehicles safety can be guaranteed for remains small. In this paper, we propose a novel reachability-based approach that guarantees safety for any number of vehicles while vehicles complete their objectives of visiting multiple targets efficiently, given any K-vehicle collision avoidance algorithm where K can in general be a small number. We achieve this by developing an approach to group vehicles into clusters efficiently and a control strategy that guarantees safety for any in-cluster and cross-cluster pair of vehicles for all time. Our proposed method is scalable to large number of vehicles with little computation overhead. We demonstrate our proposed approach with a simulation on 15 vehicles. In addition, we contribute a more general solution to the 3-vehicle collision avoidance problem from a past recent work, show that the prior work is a special case of our proposed generalization, and prove its validity.

CLNov 26, 2020
Text Analytics for Resilience-Enabled Extreme Events Reconnaissance

Alicia Y. Tsai, Selim Gunay, Minjune Hwang et al.

Post-hazard reconnaissance for natural disasters (e.g., earthquakes) is important for understanding the performance of the built environment, speeding up the recovery, enhancing resilience and making informed decisions related to current and future hazards. Natural language processing (NLP) is used in this study for the purposes of increasing the accuracy and efficiency of natural hazard reconnaissance through automation. The study particularly focuses on (1) automated data (news and social media) collection hosted by the Pacific Earthquake Engineering Research (PEER) Center server, (2) automatic generation of reconnaissance reports, and (3) use of social media to extract post-hazard information such as the recovery time. Obtained results are encouraging for further development and wider usage of various NLP methods in natural hazard reconnaissance.

LGSep 14, 2020
Implicit Graph Neural Networks

Fangda Gu, Heng Chang, Wenwu Zhu et al.

Graph Neural Networks (GNNs) are widely used deep learning models that learn meaningful representations from graph-structured data. Due to the finite nature of the underlying recurrent structure, current GNN methods may struggle to capture long-range dependencies in underlying graphs. To overcome this difficulty, we propose a graph learning framework, called Implicit Graph Neural Networks (IGNN), where predictions are based on the solution of a fixed-point equilibrium equation involving implicitly defined "state" vectors. We use the Perron-Frobenius theory to derive sufficient conditions that ensure well-posedness of the framework. Leveraging implicit differentiation, we derive a tractable projected gradient descent method to train the framework. Experiments on a comprehensive range of tasks show that IGNNs consistently capture long-range dependencies and outperform the state-of-the-art GNN models.

LGJun 15, 2020
FANOK: Knockoffs in Linear Time

Armin Askari, Quentin Rebjock, Alexandre d'Aspremont et al.

We describe a series of algorithms that efficiently implement Gaussian model-X knockoffs to control the false discovery rate on large scale feature selection problems. Identifying the knockoff distribution requires solving a large scale semidefinite program for which we derive several efficient methods. One handles generic covariance matrices, has a complexity scaling as $O(p^3)$ where $p$ is the ambient dimension, while another assumes a rank $k$ factor model on the covariance matrix to reduce this complexity bound to $O(pk^2)$. We also derive efficient procedures to both estimate factor models and sample knockoff covariates with complexity linear in the dimension. We test our methods on problems with $p$ as large as $500,000$.

LGAug 17, 2019
Implicit Deep Learning

Laurent El Ghaoui, Fangda Gu, Bertrand Travacca et al.

Implicit deep learning prediction rules generalize the recursive rules of feedforward neural networks. Such rules are based on the solution of a fixed-point equation involving a single vector of hidden features, which is thus only implicitly defined. The implicit framework greatly simplifies the notation of deep learning, and opens up many new possibilities, in terms of novel architectures and algorithms, robustness analysis and design, interpretability, sparsity, and network architecture optimization.

LGMay 23, 2019
Naive Feature Selection: a Nearly Tight Convex Relaxation for Sparse Naive Bayes

Armin Askari, Alexandre d'Aspremont, Laurent El Ghaoui

Due to its linear complexity, naive Bayes classification remains an attractive supervised learning method, especially in very large-scale settings. We propose a sparse version of naive Bayes, which can be used for feature selection. This leads to a combinatorial maximum-likelihood problem, for which we provide an exact solution in the case of binary data, or a bound in the multinomial case. We prove that our convex relaxation bounds becomes tight as the marginal contribution of additional features decreases, using a priori duality gap bounds dervied from the Shapley-Folkman theorem. We show how to produce primal solutions satisfying these bounds. Both binary and multinomial sparse models are solvable in time almost linear in problem size, representing a very small extra relative cost compared to the classical naive Bayes. Numerical experiments on text data show that the naive Bayes feature selection method is as statistically effective as state-of-the-art feature selection methods such as recursive feature elimination, $l_1$-penalized logistic regression and LASSO, while being orders of magnitude faster.

LGJan 24, 2019
Theoretically Principled Trade-off between Robustness and Accuracy

Hongyang Zhang, Yaodong Yu, Jiantao Jiao et al.

We identify a trade-off between robustness and accuracy that serves as a guiding principle in the design of defenses against adversarial examples. Although this problem has been widely studied empirically, much remains unknown concerning the theory underlying this trade-off. In this work, we decompose the prediction error for adversarial examples (robust error) as the sum of the natural (classification) error and boundary error, and provide a differentiable upper bound using the theory of classification-calibrated loss, which is shown to be the tightest possible upper bound uniform over all probability distributions and measurable predictors. Inspired by our theoretical analysis, we also design a new defense method, TRADES, to trade adversarial robustness off against accuracy. Our proposed algorithm performs well experimentally in real-world datasets. The methodology is the foundation of our entry to the NeurIPS 2018 Adversarial Vision Challenge in which we won the 1st place out of ~2,000 submissions, surpassing the runner-up approach by $11.41\%$ in terms of mean $\ell_2$ perturbation distance.

LGNov 20, 2018
Fenchel Lifted Networks: A Lagrange Relaxation of Neural Network Training

Fangda Gu, Armin Askari, Laurent El Ghaoui

Despite the recent successes of deep neural networks, the corresponding training problem remains highly non-convex and difficult to optimize. Classes of models have been proposed that introduce greater structure to the objective function at the cost of lifting the dimension of the problem. However, these lifted methods sometimes perform poorly compared to traditional neural networks. In this paper, we introduce a new class of lifted models, Fenchel lifted networks, that enjoy the same benefits as previous lifted models, without suffering a degradation in performance over classical networks. Our model represents activation functions as equivalent biconvex constraints and uses Lagrange Multipliers to arrive at a rigorous lower bound of the traditional neural network training problem. This model is efficiently trained using block-coordinate descent and is parallelizable across data points and/or layers. We compare our model against standard fully connected and convolutional networks and show that we are able to match or beat their performance.

LGNov 6, 2018
Greedy Frank-Wolfe Algorithm for Exemplar Selection

Gary Cheng, Armin Askari, Kannan Ramchandran et al.

In this paper, we consider the problem of selecting representatives from a data set for arbitrary supervised/unsupervised learning tasks. We identify a subset $S$ of a data set $A$ such that 1) the size of $S$ is much smaller than $A$ and 2) $S$ efficiently describes the entire data set, in a way formalized via convex optimization. In order to generate $|S| = k$ exemplars, our kernelizable algorithm, Frank-Wolfe Sparse Representation (FWSR), only needs to execute $\approx k$ iterations with a per-iteration cost that is quadratic in the size of $A$. This is in contrast to other state of the art methods which need to execute until convergence with each iteration costing an extra factor of $d$ (dimension of the data). Moreover, we also provide a proof of linear convergence for our method. We support our results with empirical experiments; we test our algorithm against current methods in three different experimental setups on four different data sets. FWSR outperforms other exemplar finding methods both in speed and accuracy in almost all scenarios.

MLJun 18, 2018
Kernel-based Outlier Detection using the Inverse Christoffel Function

Armin Askari, Forest Yang, Laurent El Ghaoui

Outlier detection methods have become increasingly relevant in recent years due to increased security concerns and because of its vast application to different fields. Recently, Pauwels and Lasserre (2016) noticed that the sublevel sets of the inverse Christoffel function accurately depict the shape of a cloud of data using a sum-of-squares polynomial and can be used to perform outlier detection. In this work, we propose a kernelized variant of the inverse Christoffel function that makes it computationally tractable for data sets with a large number of features. We compare our approach to current methods on 15 different data sets and achieve the best average area under the precision recall curve (AUPRC) score, the best average rank and the lowest root mean square deviation.

LGMay 3, 2018
Lifted Neural Networks

Armin Askari, Geoffrey Negiar, Rajiv Sambharya et al.

We describe a novel family of models of multi- layer feedforward neural networks in which the activation functions are encoded via penalties in the training problem. Our approach is based on representing a non-decreasing activation function as the argmin of an appropriate convex optimiza- tion problem. The new framework allows for algo- rithms such as block-coordinate descent methods to be applied, in which each step is composed of a simple (no hidden layer) supervised learning problem that is parallelizable across data points and/or layers. Experiments indicate that the pro- posed models provide excellent initial guesses for weights for standard neural networks. In addi- tion, the model provides avenues for interesting extensions, such as robustness against noisy in- puts and optimizing over parameters in activation functions.

OCOct 30, 2014
Robust sketching for multiple square-root LASSO problems

Vu Pham, Laurent El Ghaoui, Arturo Fernandez

Many learning tasks, such as cross-validation, parameter search, or leave-one-out analysis, involve multiple instances of similar problems, each instance sharing a large part of learning data with the others. We introduce a robust framework for solving multiple square-root LASSO problems, based on a sketch of the learning data that uses low-rank approximations. Our approach allows a dramatic reduction in computational effort, in effect reducing the number of observations from $m$ (the number of observations to start with) to $k$ (the number of singular values retained in the low-rank model), while not sacrificing---sometimes even improving---the statistical performance. Theoretical analysis, as well as numerical experiments on both synthetic and real data, illustrate the efficiency of the method in large scale applications.

CLApr 29, 2014
Concise comparative summaries (CCS) of large text corpora with a human experiment

Jinzhu Jia, Luke Miratrix, Bin Yu et al.

In this paper we propose a general framework for topic-specific summarization of large text corpora and illustrate how it can be used for the analysis of news databases. Our framework, concise comparative summarization (CCS), is built on sparse classification methods. CCS is a lightweight and flexible tool that offers a compromise between simple word frequency based methods currently in wide use and more heavyweight, model-intensive methods such as latent Dirichlet allocation (LDA). We argue that sparse methods have much to offer for text analysis and hope CCS opens the door for a new branch of research in this important field. For a particular topic of interest (e.g., China or energy), CSS automatically labels documents as being either on- or off-topic (usually via keyword search), and then uses sparse classification methods to predict these labels with the high-dimensional counts of all the other words and phrases in the documents. The resulting small set of phrases found as predictive are then harvested as the summary. To validate our tool, we, using news articles from the New York Times international section, designed and conducted a human survey to compare the different summarizers with human understanding. We demonstrate our approach with two case studies, a media analysis of the framing of "Egypt" in the New York Times throughout the Arab Spring and an informal comparison of the New York Times' and Wall Street Journal's coverage of "energy." Overall, we find that the Lasso with $L^2$ normalization can be effectively and usefully used to summarize large corpora, regardless of document size.

MLOct 26, 2012
Large-Scale Sparse Principal Component Analysis with Application to Text Data

Youwei Zhang, Laurent El Ghaoui

Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models.