LG PFAug 24, 2025

MoE-Inference-Bench: Performance Evaluation of Mixture of Expert Large Language and Vision Models

Krishna Teja Chitty-Venkata, Sylvia Howland, Golara Azar, Daria Soboleva, Natalia Vassilieva, Siddhisanket Raskar, Murali Emani, Venkatram Vishwanath

arXiv:2508.17467v17 citationsh-index: 19SC25-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of efficient deployment of MoE models for researchers and practitioners, but it is incremental as it focuses on benchmarking existing optimization techniques rather than introducing new methods.

The paper tackles the inference-time challenges of Mixture of Experts (MoE) models, such as load imbalance and routing overhead, by presenting MoE-Inference-Bench, a comprehensive evaluation of hardware acceleration techniques on Nvidia H100 GPUs, revealing performance differences across configurations like batch size and sequence length.

Mixture of Experts (MoE) models have enabled the scaling of Large Language Models (LLMs) and Vision Language Models (VLMs) by achieving massive parameter counts while maintaining computational efficiency. However, MoEs introduce several inference-time challenges, including load imbalance across experts and the additional routing computational overhead. To address these challenges and fully harness the benefits of MoE, a systematic evaluation of hardware acceleration techniques is essential. We present MoE-Inference-Bench, a comprehensive study to evaluate MoE performance across diverse scenarios. We analyze the impact of batch size, sequence length, and critical MoE hyperparameters such as FFN dimensions and number of experts on throughput. We evaluate several optimization techniques on Nvidia H100 GPUs, including pruning, Fused MoE operations, speculative decoding, quantization, and various parallelization strategies. Our evaluation includes MoEs from the Mixtral, DeepSeek, OLMoE and Qwen families. The results reveal performance differences across configurations and provide insights for the efficient deployment of MoEs.

View on arXiv PDF

Similar