DC AR LGJun 5, 2025

FlashMoE: Fast Distributed MoE in a Single Kernel

Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh

arXiv:2506.04667v37 citationsh-index: 4Has Code

Originality Highly original

AI Analysis

This work addresses performance bottlenecks in large-scale distributed machine learning for researchers and practitioners, offering a significant improvement over existing methods.

The paper tackles the low GPU utilization and high latency in distributed Mixture-of-Experts (MoE) models by developing FlashMoE, a fully GPU-resident operator that fuses computation and communication into a single kernel, achieving up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines.

The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thereby unlocking payload efficiency by eliminating bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models comprising up to 128 experts and 16K token sequences, FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines, despite using FP32, whereas the baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.

View on arXiv PDF Code

Similar