DS LG NAApr 25, 2017

Single-Pass PCA of Large High-Dimensional Data

Wenjian Yu, Yu Gu, Jian Li, Shenghua Liu, Yaohang Li

arXiv:1704.07669v18.050 citationsh-index: 78Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of efficient PCA computation for researchers and practitioners dealing with extremely large datasets stored on disk or in streaming environments, representing an incremental improvement over prior single-pass methods.

The authors tackled the challenge of computing PCA for large high-dimensional data by proposing a single-pass randomized algorithm that processes data in one pass, suitable for slow memory or streaming. Their algorithm achieved orders of magnitude smaller error than an existing single-pass method and computed the first 50 principal components for a 150 GB dataset in 24 minutes with less than 1 GB memory.

Principal component analysis (PCA) is a fundamental dimension reduction tool in statistics and machine learning. For large and high-dimensional data, computing the PCA (i.e., the singular vectors corresponding to a number of dominant singular values of the data matrix) becomes a challenging task. In this work, a single-pass randomized algorithm is proposed to compute PCA with only one pass over the data. It is suitable for processing extremely large and high-dimensional data stored in slow memory (hard disk) or the data generated in a streaming fashion. Experiments with synthetic and real data validate the algorithm's accuracy, which has orders of magnitude smaller error than an existing single-pass algorithm. For a set of high-dimensional data stored as a 150 GB file, the proposed algorithm is able to compute the first 50 principal components in just 24 minutes on a typical 24-core computer, with less than 1 GB memory cost.

View on arXiv PDF Code

Similar