LGMar 6

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

arXiv:2603.06003v12 citations
Predicted impact top 19% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This work offers a method to improve the deployment efficiency of Sparse Mixture-of-Experts (SMoE) language models for practitioners and researchers by reducing memory and throughput constraints.

This paper addresses the memory and throughput limitations of Sparse Mixture-of-Experts (SMoE) language models by proposing a non-uniform expert pruning method. EvoESAP, an evolutionary search framework, optimizes layer-wise sparsity allocation, leading to significant improvements in open-ended generation (up to +19.6% on MATH-500 at 50% sparsity) while maintaining competitive multiple-choice accuracy.

Sparse Mixture-of-Experts (SMoE) language models achieve strong capability at low per-token compute, yet deployment remains memory- and throughput-bound because the full expert pool must be stored and served. Post-training expert pruning reduces this cost, but most methods focus on which experts to prune within each layer and default to a uniform layer-wise sparsity allocation, even though the allocation can strongly affect performance. We decouple pruning into within-layer expert ranking and across-layer budget allocation, and introduce \textbf{E}xpected \textbf{S}peculative \textbf{A}cceptance \textbf{P}roxy (\textbf{ESAP}), a speculative-decoding-inspired, teacher-forced metric that measures how well a pruned model matches the full model. ESAP is bounded and stable, enabling cheap comparison of many candidates without costly autoregressive decoding. Building on ESAP, we propose EvoESAP, an evolutionary searching framework that optimizes a non-uniform layer-wise sparsity allocation under a fixed global budget while holding the within-layer pruning order fixed, making it a plug-and-play method with criteria such as Frequency, EAN, SEER, and REAP. Across 7B--30B SMoE LLMs at 25\% and 50\% sparsity, EvoESAP consistently discovers non-uniform allocations that improve open-ended generation (up to \textbf{+19.6\%} on MATH-500 at 50\% sparsity) while preserving competitive multiple-choice accuracy compared with uniform pruning at the same sparsity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes