LGDCSep 9, 2025

MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?

arXiv:2509.07727v13 citationsh-index: 5
Originality Incremental advance
AI Analysis

This work addresses a critical problem for deploying large MoE models in resource-constrained environments, offering an incremental optimization to existing offloading techniques.

The paper tackles the challenge of efficiently serving Mixture of Experts (MoE) models under GPU memory constraints by compressing non-activated experts to reduce data transfer overhead, finding that compression errors in shallow layers cause minimal accuracy degradation, middle layers significantly impair accuracy, and deep layers can sometimes improve accuracy.

With the widespread application of Mixture of Experts (MoE) reasoning models in the field of LLM learning, efficiently serving MoE models under limited GPU memory constraints has emerged as a significant challenge. Offloading the non-activated experts to main memory has been identified as an efficient approach to address such a problem, while it brings the challenges of transferring the expert between the GPU memory and main memory. We need to explore an efficient approach to compress the expert and analyze how the compression error affects the inference performance. To bridge this gap, we propose employing error-bounded lossy compression algorithms (such as SZ3 and CuSZp) to compress non-activated experts, thereby reducing data transfer overhead during MoE inference. We conduct extensive experiments across various benchmarks and present a comprehensive analysis of how compression-induced errors in different experts affect overall inference accuracy. The results indicate that experts in the shallow layers, which are primarily responsible for the attention mechanism and the transformation of input tokens into vector representations, exhibit minimal degradation in inference accuracy when subjected to bounded errors. In contrast, errors in the middle-layer experts, which are central to model reasoning, significantly impair inference accuracy. Interestingly, introducing bounded errors in the deep-layer experts, which are mainly responsible for instruction following and output integration, can sometimes lead to improvements in inference accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes