DC LGOct 3, 2023

VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores

Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler

arXiv:2310.02065v110.837 citationsh-index: 27Has Code

Originality Highly original

AI Analysis

This enables more efficient deep learning inference with higher sparsity ratios, addressing computational bottlenecks for AI practitioners, though it is incremental as it builds on existing SPTC hardware.

The paper tackles the limitation of NVIDIA's Sparse Tensor Cores (SPTCs) to 50% sparsity by introducing the V:N:M format for arbitrary sparsity ratios, achieving up to 37x speedup over cuBLAS with a second-order pruning technique that maintains accuracy in transformers.

The increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA's Sparse Tensor Cores (SPTCs), which promise a 2x speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.

View on arXiv PDF Code

Similar