Efficient Dynamic Structured Sparse Training with Learned Shuffles
This work addresses the efficiency-accuracy trade-off in sparse training for deep learning practitioners, offering a novel hybrid approach that is incremental but impactful.
The paper tackled the problem of structured sparsity in neural networks trailing unstructured methods in accuracy by proposing a method that learns permutations to enhance expressivity, achieving comparable accuracy to unstructured baselines at 90-95% sparsity on ImageNet-1K and WikiText-103 while training up to 1.21x and inferring up to 2.9x faster.
Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any $w$ active weights out of $n$, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures -- block, N:M, and diagonals -- we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90--95\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to $1.21\times$ and infers up to $2.9\times$ faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.