NI AIApr 30

Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving

Junsun Choi, Sam Son, Sunjin Choi, Hansung Kim, Yakun Sophia Shao, Scott Shenker, Sylvia Ratnasamy, Borivoje Nikolic

arXiv:2605.0025493.3

AI Analysis

For cloud providers deploying MoE LLMs, this work challenges the necessity of expensive high-bandwidth scale-up networks, offering a cost-effective alternative.

This paper analyzes network topologies for MoE LLM serving, finding that switchless topologies (e.g., 3D full-mesh) improve cost-effectiveness by 20.6-56.2% over scale-up networks, and that current scale-up link bandwidths are over-provisioned, with reductions improving throughput per cost by up to 27%.

Mixture-of-experts (MoE) architectures have turned LLM serving into a cluster-scale workload in which communication consumes a considerable portion of LLM serving runtime. This has prompted industry to invest heavily in expensive high-bandwidth scale-up networks. We question whether such costly infrastructure is strictly necessary. We present the first systematic cross-layer analysis of network cost-effectiveness for MoE LLM serving, comparing four representative XPU (e.g., GPU/TPU) topologies (scale-up, scale-out, 3D torus, and 3D full-mesh). We find that lower-cost switchless topologies are more cost-effective than the scale-up topology across all serving scenarios explored, improving cost-effectiveness by 20.6-56.2%. In particular, the 3D full-mesh topology is Pareto-optimal in terms of the performance-cost tradeoff. We also find that current scale-up link bandwidths are over-provisioned: reducing the link bandwidth improves throughput per cost by up to 27%. A forward-looking analysis of upcoming GPU generations indicates that the cost-performance advantage of switchless networks will likely persist.

View on arXiv PDF

Similar