LGNAOct 19, 2016

CuMF_SGD: Fast and Scalable Matrix Factorization

arXiv:1610.05838v314 citations
Originality Incremental advance
AI Analysis

This work addresses scalability and performance bottlenecks in matrix factorization for applications like recommender systems, though it is incremental as it optimizes an existing method (SGD) rather than introducing a new paradigm.

The paper tackles the memory-bound nature of stochastic gradient descent (SGD) for matrix factorization by developing cuMF_SGD, a CUDA-based solution that leverages GPUs' high memory bandwidth and fast intra-node connections. The result is a 3.1X-28.2X speedup compared to state-of-the-art CPU solutions on various datasets using a single GPU.

Matrix factorization (MF) has been widely used in e.g., recommender systems, topic modeling and word embedding. Stochastic gradient descent (SGD) is popular in solving MF problems because it can deal with large data sets and is easy to do incremental learning. We observed that SGD for MF is memory bound. Meanwhile, single-node CPU systems with caching performs well only for small data sets; distributed systems have higher aggregated memory bandwidth but suffer from relatively slow network connection. This observation inspires us to accelerate MF by utilizing GPUs's high memory bandwidth and fast intra-node connection. We present cuMF_SGD, a CUDA-based SGD solution for large-scale MF problems. On a single CPU, we design two workload schedule schemes, i.e., batch-Hogwild! and wavefront-update that fully exploit the massive amount of cores. Especially, batch-Hogwild! as a vectorized version of Hogwild! overcomes the issue of memory discontinuity. We also develop highly-optimized kernels for SGD update, leveraging cache, warp-shuffle instructions and half-precision floats. We also design a partition scheme to utilize multiple GPUs while addressing the well-known convergence issue when parallelizing SGD. On three data sets with only one Maxwell or Pascal GPU, cuMF_SGD runs 3.1X-28.2X as fast compared with state-of-art CPU solutions on 1-64 CPU nodes. Evaluations also show that cuMF_SGD scales well on multiple GPUs in large data sets.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes