LGMLOct 16, 2018

Fast Randomized PCA for Sparse Data

arXiv:1810.06825v132 citations
Originality Incremental advance
AI Analysis

This work provides a practical solution for researchers and practitioners in fields like social network analysis and information retrieval who need efficient PCA on large sparse data, though it is incremental as it builds on existing randomized SVD methods.

The authors tackled the problem of performing PCA on large sparse datasets, which is computationally expensive and memory-intensive, by proposing a fast randomized PCA algorithm optimized for sparse data, achieving up to 9.1x speedup over existing methods without accuracy loss and handling datasets with over 12 million entries in under 400 seconds.

Principal component analysis (PCA) is widely used for dimension reduction and embedding of real data in social network analysis, information retrieval, and natural language processing, etc. In this work we propose a fast randomized PCA algorithm for processing large sparse data. The algorithm has similar accuracy to the basic randomized SVD (rPCA) algorithm (Halko et al., 2011), but is largely optimized for sparse data. It also has good flexibility to trade off runtime against accuracy for practical usage. Experiments on real data show that the proposed algorithm is up to 9.1X faster than the basic rPCA algorithm without accuracy loss, and is up to 20X faster than the svds in Matlab with little error. The algorithm computes the first 100 principal components of a large information retrieval data with 12,869,521 persons and 323,899 keywords in less than 400 seconds on a 24-core machine, while all conventional methods fail due to the out-of-memory issue.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes