LGCLDCDec 12, 2023

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

arXiv:2312.07743v1h-index: 24Has CodeICS
Originality Incremental advance
AI Analysis

This work addresses performance bottlenecks for researchers and practitioners using Word2Vec in NLP and other domains, offering a significant but incremental improvement over prior GPU optimizations.

The paper tackles the high computational cost of Word2Vec on GPUs by identifying memory access as a bottleneck and proposes FULL-W2V, a novel algorithm that reduces GPU global memory accesses by over 89% and achieves a 5.72X speedup over state-of-the-art implementations on V100 cards while maintaining embedding quality.

Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes