Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
This work addresses efficiency challenges in large language models for AI practitioners, though it is incremental as it builds on existing hardware-accelerated sparsity techniques.
The paper tackled the problem of accelerating large language model training and inference by leveraging 2:4 sparsity patterns in activations, achieving up to 1.3x faster Feed Forward Networks with no accuracy loss.
In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.