LGAICLJan 20

Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models

arXiv:2601.14327v1
Originality Incremental advance
AI Analysis

This addresses efficiency issues for researchers and practitioners training large-scale MoE models, representing an incremental improvement over existing expert pruning methods.

The paper tackles the computational bottleneck in pre-training Mixture-of-Experts Large Language Models by introducing a Layer-Adaptive Expert Pruning algorithm, which improves training efficiency by 48.3% and reduces parameters by 33.3% while maintaining performance.

Although Mixture-of-Experts (MoE) Large Language Models (LLMs) deliver superior accuracy with a reduced number of active parameters, their pre-training represents a significant computationally bottleneck due to underutilized experts and limited training efficiency. This work introduces a Layer-Adaptive Expert Pruning (LAEP) algorithm designed for the pre-training stage of MoE LLMs. In contrast to previous expert pruning approaches that operate primarily in the post-training phase, the proposed algorithm enhances training efficiency by selectively pruning underutilized experts and reorganizing experts across computing devices according to token distribution statistics. Comprehensive experiments demonstrate that LAEP effectively reduces model size and substantially improves pre-training efficiency. In particular, when pre-training the 1010B Base model from scratch, LAEP achieves a 48.3\% improvement in training efficiency alongside a 33.3% parameter reduction, while still delivering excellent performance across multiple domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes