CL DCOct 16, 2024

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai

arXiv:2410.12247v26.110 citationsh-index: 5

Originality Highly original

AI Analysis

This work addresses inference efficiency challenges for large language models using MoE architectures, offering a domain-specific optimization that is incremental but provides strong performance gains.

The paper tackles the computational and communication bottlenecks in Mixture-of-Experts (MoE) model inference by introducing EPS-MoE, a novel expert pipeline scheduler that dynamically optimizes kernel selection and overlaps computation with communication, achieving up to 52.4% improvement in prefill throughput and accelerating the DeepSeekV2 model from 100K to at least 120K tokens per second.

The Mixture-of-Experts (MoE) model has emerged as a prominent architecture in the field of Large Language Models (LLMs), providing a better balance between model performance and computational efficiency. However the General Matrix Multiply (GEMM) operations and large parameters introduce challenges related to computational efficiency and communication overhead, which become throughput bottlenecks during inference. Applying a single parallelism strategy like EP, DP, TP or a straightforward combination of them to MoE usually achieves sub-optimal inference throughput. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our approach optimizes the computation of MoE FeedForward Network (FFN) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with communication, leading to a substantial increase in throughput. Our experimental results demonstrate at most 52.4\% improvement in prefill throughput compared to existing parallel inference methods. Specifically, our method accelerated the highly optimized DeepSeekV2 model from a claimed 100K tokens per second to at least 120K tokens per second.

View on arXiv PDF

Similar