LG IT NA MLNov 19, 2013

Near-Optimal Entrywise Sampling for Data Matrices

Dimitris Achlioptas, Zohar Karnin, Edo Liberty

arXiv:1311.4643v142 citations

Originality Incremental advance

AI Analysis

This work addresses efficient matrix sketching for large datasets in streaming settings, offering a practical solution with theoretical guarantees, though it is incremental in improving upon existing sampling methods.

The paper tackles the problem of selecting non-zero entries of a matrix to produce a sparse sketch that minimizes the spectral norm error, proposing sampling distributions that are computable from minimal information, support streaming with O(1) computation per non-zero, yield compressible sketches, and are provably competitive with the optimal offline distribution.

We consider the problem of selecting non-zero entries of a matrix $A$ in order to produce a sparse sketch of it, $B$, that minimizes $\|A-B\|_2$. For large $m \times n$ matrices, such that $n \gg m$ (for example, representing $n$ observations over $m$ attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding $A$. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with $O(1)$ computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model.

View on arXiv PDF

Similar