Characterizing GPU Resilience and Impact on AI/HPC Systems
This work addresses GPU reliability issues for AI/HPC system operators, providing empirical data to guide hardware and software design, but it is incremental as it builds on prior resilience studies.
This study analyzed GPU resilience in a large-scale AI/HPC system using 2.5 years of operational data, finding that H100 GPUs have worse memory resilience than A100 GPUs with 3.2x lower MTBE for memory errors, and projecting that 5% overprovisioning is needed to handle failures.
This study characterizes GPU resilience in Delta HPC, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. Delta HPC is operated by the National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include: (i) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors, (ii) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity, (iii) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components, (iv) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level, and (v) We project the impact of GPU node availability on larger-scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.