LGCLOct 3, 2023

Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness

Microsoft
arXiv:2310.02410v137 citationsh-index: 23
Originality Incremental advance
AI Analysis

This addresses deployment challenges for large MoE models in language tasks like machine translation, offering a practical solution with significant efficiency gains.

The paper tackles the memory consumption and bandwidth bottleneck of large Mixture of Experts (MoE) models by proposing Mixture of Quantized Experts (MoQE), a weight-only quantization method that applies ultra low-bit (down to 2-bit) quantization to expert weights, reducing model size by 79.6% and achieving 1.24X speed-up on A100 GPUs while maintaining or improving performance compared to dense models.

Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism. However, it has brought a fundamental issue of larger memory consumption and increased memory bandwidth bottleneck at deployment time. In this paper, we propose Mixture of Quantized Experts (MoQE) which is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights for mitigating the increased memory and latency issues of MoE models. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance while reducing the memory size significantly even without any additional training in most cases. In particular, expert layers in MoE models are much more robust to the quantization than conventional feedforward networks (FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit expert weights can deliver better model performance than the dense model trained on the same dataset. As a result of low-bit quantization, we show the model size can be reduced by 79.6% of the original half precision floating point (fp16) MoE model. Combined with an optimized GPU runtime implementation, it also achieves 1.24X speed-up on A100 GPUs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes