CLAILGFeb 26, 2025

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

arXiv:2502.19261v217 citationsh-index: 7ICLR
Originality Incremental advance
AI Analysis

This addresses a bottleneck in efficiently training large-scale MoE models for natural language processing, offering a method to reduce computational costs while maintaining performance, though it is incremental as it builds on existing upcycling approaches.

The paper tackles the problem of slow training progress and suboptimal long-term performance in Mixture of Experts (MoE) models initialized from pre-trained dense models, proposing Drop-Upcycling, which combines knowledge from pre-trained models with partial re-initialization to enhance expert specialization, resulting in an MoE model with 5.9B active parameters achieving comparable performance to a 13B dense model while using about 1/4 of the training FLOPs.

The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes