Multi-Plane HyperX: A Low-Latency and Cost-Effective Network for Large-Scale AI and HPC Systems
For system architects designing low-latency, cost-effective networks for large-scale AI and HPC clusters, this paper proposes a novel topology that outperforms existing state-of-the-art direct networks.
This work introduces multi-plane HyperX, a network topology for large-scale AI and HPC systems, and shows it achieves smaller network diameter and better cost-effectiveness compared to multi-plane Fat-Tree, Dragonfly, and Dragonfly+.
Multi-plane architectures have become increasingly prevalent in the Fat-Tree networks of AI data centers. By leveraging multiple ports on a single network interface card (NIC) or multiple NICs within a scale-up domain, each port or NIC is allocated to an independent network plane, thereby provisioning the overall system with multiple network planes. However, no prior literature has explored the application of multi-plane technologies to direct networks such as HyperX. This paper investigates the multi-plane HyperX network and demonstrates that, compared to state-of-the-art network topologies like multi-plane Fat-Tree, Dragonfly, and Dragonfly+, the multi-plane HyperX architecture achieves a significantly smaller network diameter and superior cost-effectiveness.