Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference
This work addresses the challenge of minimizing network traffic for MoE LLM inference in multi-server clusters, which is an incremental improvement for deployment efficiency.
The paper tackles the problem of efficiently deploying Mixture-of-Experts (MoE) LLMs for inference by proposing an integer linear program (ILP) to optimize expert placement based on network topology, resulting in reduced network traffic compared to competitors for models like DeepSeekMoE~16B and DeepSeek-R1~671B.
Efficient deployment of a pre-trained LLM to a cluster with multiple servers is a critical step for providing fast responses to users' queries. The recent success of Mixture-of-Experts (MoE) LLMs raises the question of how to deploy them efficiently, considering their underlying structure. During the inference in MoE LLMs, only a small part of the experts is selected to process a given token. Moreover, in practice, the experts' load is highly imbalanced. For efficient deployment, one has to distribute the model across a large number of servers using a model placement algorithm. Thus, to improve cluster utilization, the model placement algorithm has to take into account the network topology. This work focuses on the efficient topology-aware placement of the pre-trained MoE LLMs in the inference stage. We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions. Due to the internal structure, this optimization problem can be solved with a standard ILP solver. We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE~16B) and large-scale (DeepSeek-R1~671B) models.