CLLGFeb 8, 2024

Efficient Stagewise Pretraining via Progressive Subnetworks

Tsinghua
arXiv:2402.05913v210 citationsh-index: 41ICLR
Originality Highly original
AI Analysis

This work addresses the problem of inefficient pretraining for large language models, offering a novel method that is competitive with existing approaches.

This paper challenges the prevailing view that stagewise dropping strategies are ineffective for efficient pretraining by developing a progressive subnetwork training framework called RAPTR, which trains random subnetworks and progressively increases their size, achieving up to 33% faster training on BERT and UL2 benchmarks and improving downstream performance on UL2 by 1.5%.

Recent developments in large language models have sparked interest in efficient pretraining methods. Stagewise training approaches to improve efficiency, like gradual stacking and layer dropping (Reddi et al, 2023; Zhang & He, 2020), have recently garnered attention. The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective, especially when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. Specifically, we develop a principled stagewise training framework, progressive subnetwork training, which only trains subnetworks within the model and progressively increases the size of subnetworks during training, until it trains the full network. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork (e.g. depth-wise, width-wise) of the network at each step, progressively increasing the size in stages. We show that this approach not only generalizes prior works like layer dropping but also fixes their key issues. Furthermore, we establish a theoretical basis for such approaches and provide justification for (a) increasing complexity of subnetworks in stages, conceptually diverging from prior works on layer dropping, and (b) stability in loss across stage transitions in presence of key modern architecture components like residual connections and layer norms. Through comprehensive experiments, we demonstrate that RAPTR can significantly speed up training of standard benchmarks like BERT and UL2, up to 33% compared to standard training and, surprisingly, also shows better downstream performance on UL2, improving QA tasks and SuperGLUE by 1.5%; thereby, providing evidence of better inductive bias.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes