Jesus Escudero-Sahuquillo

2.1NIApr 17

Characterization of Real Communication Patterns and Congestion Dynamics in HPC Interconnection Networks

Miguel Sánchez de La Rosa, Gabriel Gomez-Lopez, Alejandro Baviera et al.

The interconnection network is a key component of Supercomputers and Data centers, and its design must cope with the increasing communication demands of current applications and services; otherwise, it may become a system bottleneck. The most challenging network design issues are the topology, routing algorithm, flow control, and power efficiency. However, even the most efficient interconnection networks may suffer severe performance degradation due to congestion, especially under specific network traffic patterns generated by communication operations in high-performance computing~(HPC), deep learning training, or online data-intensive services. In this context, characterizing and modeling these communication operations and the network traffic patterns they generate is a fundamental challenge for studying their impact on network performance. This paper presents a methodology, based primarily on the VEF Traces framework, to characterize, model, and simulate the communication patterns of representative computing- and data-intensive applications. More precisely, we have extended the VEF traces framework with tools that enable us to characterize network congestion, either directly from VEF traces or via simulations. We have analyzed a set of VEF traces obtained from runs of NEST, GROMACS, LAMMPS, and PATMOS on several Supercomputers. In these studies, we identify potential congestion scenarios that arise in realistic network configurations when certain collective operations are performed.

2.1NIApr 21

On the Power Saving in High-Speed Ethernet-based Networks for Supercomputers and Data Centers

Miguel Sánchez de la Rosa, Francisco J. andújar, Jesus Escudero-Sahuquillo et al.

The increase in computation and storage has led to a significant growth in the scale of systems powering applications and services, raising concerns about sustainability and operational costs. In this paper, we explore power-saving techniques in high-performance computing (HPC) and datacenter networks, and their relation with performance degradation. From this premise, we propose leveraging Energy Efficient Ethernet (EEE) protocol, with the flexibility to extend to conventional Ethernet or upcoming Ethernet-derived interconnect versions of BXI and Omnipath. We analyze the PerfBound power-saving mechanism, identifying possible improvements and modeling it into a simulation framework. Through different experiments, we examine its impact on performance and determine the most appropriate interconnect. We also study traffic patterns generated by selected HPC and machine learning applications to evaluate the behavior of power-saving techniques. From these experiments, we provide an analysis of how applications affect system and network energy consumption. Based on this, we disclose the weakness of dynamic power-down mechanisms and propose an approach that improves energy reduction with minimal or no performance penalty. To the best of our knowledge, this work presents the first thorough analysis of PerfBound and an enhancement to the technique, while also targeting emerging post-exascale networks.

Jesus Escudero-Sahuquillo

2 Papers