MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
This work addresses efficiency issues for deploying large MoE models, offering incremental improvements in quantization techniques for specific hardware and model types.
The paper tackles the deployment challenges of Mixture-of-Experts (MoE) models by introducing MxMoE, a mixed-precision quantization framework that optimizes for both accuracy and performance, achieving up to 3.4x speedup over full precision and lower perplexity than existing methods at low bit-widths.
Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.