LG CLJan 25, 2025

ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning

Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zhen Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie

arXiv:2501.15316v119.79 citationsh-index: 10Trans. Mach. Learn. Res.

Originality Incremental advance

AI Analysis

This work addresses deployment challenges for large language models on resource-constrained devices, offering a novel approach to reduce active parameters without permanent deletion, though it is incremental in the context of existing pruning and MoE methods.

The paper tackles the high computational and memory costs of large language models by converting dense models to a mixture-of-experts architecture through dynamic structural pruning, achieving consistent performance improvements over prior pruning techniques across multiple model families without fine-tuning.

Large Language Models (LLMs) have demonstrated remarkable abilities in tackling a wide range of complex tasks. However, their huge computational and memory costs raise significant challenges in deploying these models on resource-constrained devices or efficiently serving them. Prior approaches have attempted to alleviate these problems by permanently removing less important model structures, yet these methods often result in substantial performance degradation due to the permanent deletion of model parameters. In this work, we tried to mitigate this issue by reducing the number of active parameters without permanently removing them. Specifically, we introduce a differentiable dynamic pruning method that pushes dense models to maintain a fixed number of active parameters by converting their MLP layers into a Mixture of Experts (MoE) architecture. Our method, even without fine-tuning, consistently outperforms previous structural pruning techniques across diverse model families, including Phi-2, LLaMA-2, LLaMA-3, and Qwen-2.5.

View on arXiv PDF

Similar