LGSep 15, 2023

Scaling Laws for Sparsely-Connected Foundation Models

arXiv:2309.08520v148 citationsh-index: 52
Originality Incremental advance
AI Analysis

This provides theoretical understanding and practical implications for using sparsity to improve computational efficiency in foundation models, though it's incremental within the sparsity optimization field.

The authors identified the first scaling law describing how weight sparsity, number of non-zero parameters, and training data amount relate in Transformers, validated empirically across ViT/JFT-4B and T5/C4 models, showing optimal sparsity increases with more training data for a fixed parameter count.

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes