Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon
This work addresses performance bottlenecks for quantized machine learning on Apple Silicon, but it is incremental as it optimizes an existing method for a specific hardware platform.
The paper tackled the under-optimization of Sparse Ternary GEMM for Apple Silicon CPUs by developing an architecture-aware kernel, achieving up to a 5.98x performance increase over a baseline and reaching up to 50.2% of the processor's theoretical peak performance.
Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor's theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.