LGARJun 9, 2025

MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing

arXiv:2506.07366v12 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses load balancing for efficient MoE inference in multi-GPU systems, representing an incremental improvement over existing methods.

The paper tackles load imbalance in multi-GPU Mixture-of-Experts (MoE) networks by proposing MoE-GPS, a framework that guides the selection of optimal prediction strategies for dynamic expert duplication, resulting in over 23% improvement in end-to-end inference performance on the Mixtral 8x7B MMLU dataset.

In multi-GPU Mixture-of-Experts (MoE) network, experts are distributed across different GPUs, which creates load imbalance as each expert processes different number of tokens. Recent works improve MoE inference load balance by dynamically duplicating popular experts to more GPUs to process excessive tokens, which requires predicting the distribution before routing. In this paper, we discuss the tradeoff of prediction strategies, accuracies, overhead, and end-to-end system performance. We propose MoE-GPS, a framework that guides the selection of the optimal predictor design under various system configurations, by quantifying the performance impact to system-level model runtime. Specifically, we advocate for Distribution-Only Prediction, a prediction strategy that only predicts overall token distribution which significantly reduces overhead compared to the traditional Token-to-Expert Prediction. On Mixtral 8x7B MMLU dataset, MoE-GPS suggests Distribution-Only Prediction which improves end-to-end inference performance by more than 23% compared with Token-to-Expert Prediction.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes