Masked Matrix Multiplication for Emergent Sparsity
This addresses performance bottlenecks in transformer models and other AI workloads for hardware designers and practitioners, though it is incremental as it builds on existing matrix multiplication techniques.
The paper tackles the inefficiency of AI workloads with emergent sparsity on dense hardware by developing a masked matrix multiplication system that adapts to runtime sparsity, achieving up to 2 times speedup and 4 times fewer instructions for sparsity levels from 60% to 95% zeros compared to Intel MKL routines.
Artificial intelligence workloads, especially transformer models, exhibit emergent sparsity in which computations perform selective sparse access to dense data. The workloads are inefficient on hardware designed for dense computations and do not map well onto sparse data representations. We build a vectorized and parallel matrix-multiplication system A X B = C that eliminates unnecessary computations and avoids branches based on a runtime evaluation of sparsity. We use a combination of dynamic code lookup to adapt to the specific sparsity encoded in the B matrix and preprocessing of sparsity maps of the A and B matrices to compute conditional branches once for the whole computation. For a wide range of sparsity, from 60% to 95% zeros, our implementation performs fewer instructions and increases performance when compared with Intel MKL's dense or sparse matrix multiply routines. Benefits can be as large as 2 times speedup and 4 times fewer instructions.