Ramón Beivide

2papers

2 Papers

10.1DCMay 27
Resource Allocation in HyperX Networks

Alejandro Cano, Cristóbal Camarero, Carmen Martínez et al.

As high-performance computing systems scale in size and complexity, efficient resource management is essential to minimize communication overhead. The HyperX is a richly connected, low-diameter network that offers a scalable and cost-effective alternative to traditional topologies. However, resource allocation in HyperX remains underexplored, and strategies designed for networks like Torus, Fat-tree, or Dragonfly do not directly transfer. In this work, we propose and formalize several resource allocation strategies for HyperX networks, categorized into linear, geometric, and stochastic functions. We characterize these strategies theoretically by analyzing their topological properties, including dilation, convexity, and partition bandwidth.Furthermore, we conduct an exhaustive experimental evaluation using synthetic traffic and application communication kernels to assess the impact of these strategies on performance under different routing algorithms. Our results indicate that partition bandwidth and switch locality are decisive factors in mitigating interferences. Notably, the Diagonal allocation strategy, which is not convex, consistently outperforms traditional approaches in most scenarios. Finally, we provide a set of lessons learned to guide the implementation of resource allocation policies in HPC systems based on HyperX networks.

22.2NIMay 26
Extreme-Scale Interconnection Networks

Alejandro Cano, Cristina Brinza, Cristóbal Camarero et al.

Extreme-scale data centers are the backbone of next-generation computing, enabling breakthroughs in science, artificial intelligence, and global innovation through unprecedented processing power and scalability. This work examines leaf-spine network topologies that offer extreme scalability--connecting a vast number of endpoints--while delivering strong performance at low cost. It takes as a starting point two alternatives to the widely used Fat-Tree topology: the Orthogonal Fat-Tree and the Random Folded Clos. The resulting Multipass Random Leaf-Spine (MRLS) networks inherit their advantages and surpass Fat-Trees in both throughput and flexibility. To fully leverage the topological properties of these networks, various non-minimal routing strategies are considered. An exhaustive evaluation using an interconnection network simulator provides insight into the trade-offs and scalability of these topologies under realistic conditions, positioning them as a promising solution for extreme-scale systems. The MRLS achieves a 50% speedup against a Fat-Tree for an All2All collective comprising 100k endpoints, and 100% against Dragonfly networks for the same collective.