LGAug 21, 2024

Mixed Sparsity Training: Achieving 4$\times$ FLOP Reduction for Transformer Pretraining

arXiv:2408.11746v1h-index: 8
Originality Incremental advance
AI Analysis

This addresses the computational bottleneck for researchers and practitioners training transformer-based models, though it is incremental as it builds on dynamic sparse training methods.

The paper tackles the high computational cost of pretraining large language models by proposing Mixed Sparsity Training, which reduces FLOPs by 75% (4× reduction) while maintaining performance on GPT-2.

Large language models (LLMs) have made significant strides in complex tasks, yet their widespread adoption is impeded by substantial computational demands. With hundreds of billion parameters, transformer-based LLMs necessitate months of pretraining across a high-end GPU cluster. However, this paper reveals a compelling finding: transformers exhibit considerable redundancy in pretraining computations, which motivates our proposed solution, Mixed Sparsity Training (MST), an efficient pretraining method that can reduce about $75\%$ of Floating Point Operations (FLOPs) while maintaining performance. MST integrates dynamic sparse training (DST) with Sparsity Variation (SV) and Hybrid Sparse Attention (HSA) during pretraining, involving three distinct phases: warm-up, ultra-sparsification, and restoration. The warm-up phase transforms the dense model into a sparse one, and the restoration phase reinstates connections. Throughout these phases, the model is trained with a dynamically evolving sparse topology and an HSA mechanism to maintain performance and minimize training FLOPs concurrently. Our experiment on GPT-2 showcases a FLOP reduction of $4\times$ without compromising performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes