LGCLMay 21

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

arXiv:2605.2307877.4Has Code
Predicted impact top 17% in LG · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the memory bottleneck of MoE LLMs for practitioners deploying large models, offering a more effective quantization approach than prior layer-wise methods.

GEMQ proposes a global expert-level mixed-precision quantization method for MoE LLMs that uses linear programming for bit-width allocation and router fine-tuning to adapt to quantization, achieving significant memory reduction and inference acceleration with minimal accuracy loss.

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes