ARLGJul 16, 2022

S4: a High-sparsity, High-performance AI Accelerator

arXiv:2207.08006v16 citationsh-index: 19
AI Analysis

This addresses the problem of inefficient inference for AI practitioners by enabling high-sparsity acceleration, which is incremental as it builds on existing sparse pruning techniques.

The authors tackled the challenge of accelerating neural network inference by exploiting high-degree sparsity, introducing S4, a commercial hardware platform that supports up to 32x sparsity and demonstrates several-times speedup over mainstream platforms like Nvidia T4.

Exploiting sparsity underlying neural networks has become one of the most potential methodologies to reduce the memory footprint, I/O cost, and computation workloads during inference. And the degree of sparsity one can exploit has become higher as larger model sizes have been considered along with the trend of pre-training giant models. On the other hand, compared with quantization that has been a widely supported option, acceleration through high-degree sparsity is not supported in most computing platforms. In this work, we introduce the first commercial hardware platform supporting high-degree sparsity acceleration up to 32 times -- S4. Combined with state-of-the-art sparse pruning techniques, we demonstrate several-times practical inference speedup on S4 over mainstream inference platforms such as Nvidia T4. We also show that in practice a sparse model of larger size can achieve both higher accuracy and higher throughput on S4 than a dense model of smaller size.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes