PFLGOct 8, 2025

Accelerating Sparse Ternary GEMM for Quantized ML on Apple Silicon

arXiv:2510.06957v2
Originality Incremental advance
AI Analysis

This work addresses performance bottlenecks for quantized machine learning on Apple Silicon, but it is incremental as it optimizes an existing method for a specific hardware platform.

The paper tackled the under-optimization of Sparse Ternary GEMM for Apple Silicon CPUs by developing an architecture-aware kernel, achieving up to a 5.98x performance increase over a baseline and reaching up to 50.2% of the processor's theoretical peak performance.

Sparse Ternary General Matrix-Matrix Multiplication (GEMM) remains under-optimized in existing libraries for Apple Silicon CPUs. We present a Sparse Ternary GEMM kernel optimized specifically for Apple's M-series processors. We propose a set of architecture-aware optimizations, including a novel blocked and interleaved sparse data format to improve memory locality, strategies to increase Instruction-Level Parallelism (ILP), and NEON-based Single Instruction Multiple Data (SIMD) vectorization to exploit data-level parallelism. Our scalar implementation achieves up to a 5.98x performance increase over a traditional Ternary Compressed Sparse Column (TCSC) baseline for large matrices with 50% ternary nonzero values (sparsity), reaching up to a 50.2% of the processor's theoretical peak performance, and remains stable across varying sparsity levels. Our vectorized implementation delivers up to a 5.59x performance increase for large matrices with 25% sparsity, and remains stable across varying sparsity levels.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes