CLApr 10, 2025

Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models

arXiv:2504.07807v14 citationsh-index: 16Has Code
Originality Incremental advance
AI Analysis

This addresses a critical deployment problem for users of large MoE models like GPT-4 by enabling more efficient compression, though it is an incremental improvement over prior pruning approaches.

The paper tackles the challenge of reducing the parameter footprint of Mixture-of-Experts (MoE) large language models, which suffer from expert redundancy, by proposing a two-stage pruning framework called C-Prune that groups and removes similar experts, resulting in effective model size reduction and outperforming existing methods.

Mixture-of-Experts (MoE) architectures have emerged as a promising paradigm for scaling large language models (LLMs) with sparse activation of task-specific experts. Despite their computational efficiency during inference, the massive overall parameter footprint of MoE models (e.g., GPT-4) introduces critical challenges for practical deployment. Current pruning approaches often fail to address two inherent characteristics of MoE systems: 1).intra-layer expert homogeneity where experts within the same MoE layer exhibit functional redundancy, and 2). inter-layer similarity patterns where deeper layers tend to contain progressively more homogeneous experts. To tackle these issues, we propose Cluster-driven Expert Pruning (C-Prune), a novel two-stage framework for adaptive task-specific compression of MoE LLMs. C-Prune operates through layer-wise expert clustering, which groups functionally similar experts within each MoE layer using parameter similarity metrics, followed by global cluster pruning, which eliminates redundant clusters across all layers through a unified importance scoring mechanism that accounts for cross-layer homogeneity. We validate C-Prune through extensive experiments on multiple MoE models and benchmarks. The results demonstrate that C-Prune effectively reduces model size while outperforming existing MoE pruning methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes