LGCLJan 25, 2025

ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning

arXiv:2501.15316v19 citationsh-index: 10Trans. Mach. Learn. Res.
Originality Incremental advance
AI Analysis

This work addresses deployment challenges for large language models on resource-constrained devices, offering a novel approach to reduce active parameters without permanent deletion, though it is incremental in the context of existing pruning and MoE methods.

The paper tackles the high computational and memory costs of large language models by converting dense models to a mixture-of-experts architecture through dynamic structural pruning, achieving consistent performance improvements over prior pruning techniques across multiple model families without fine-tuning.

Large Language Models (LLMs) have demonstrated remarkable abilities in tackling a wide range of complex tasks. However, their huge computational and memory costs raise significant challenges in deploying these models on resource-constrained devices or efficiently serving them. Prior approaches have attempted to alleviate these problems by permanently removing less important model structures, yet these methods often result in substantial performance degradation due to the permanent deletion of model parameters. In this work, we tried to mitigate this issue by reducing the number of active parameters without permanently removing them. Specifically, we introduce a differentiable dynamic pruning method that pushes dense models to maintain a fixed number of active parameters by converting their MLP layers into a Mixture of Experts (MoE) architecture. Our method, even without fine-tuning, consistently outperforms previous structural pruning techniques across diverse model families, including Phi-2, LLaMA-2, LLaMA-3, and Qwen-2.5.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes