The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance
For GPU reliability engineers and LLM training operators, this work provides empirical data to improve SDC modeling and resilience evaluation, though it is incremental as it applies known fault injection methods to a new GPU architecture.
The paper characterizes silent data corruption (SDC) in GPUs through large-scale gate-level fault injection, revealing that NaN/INF outcomes are rare (1.01%), single-bit flips are under 40%, and corruption addresses show periodicity, providing guidance for realistic fault modeling.
Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this gap, we conducted a large-scale gate-level stuck-at fault injection on a production-class data-center GPU, consuming over three million simulator hours across 63 CUDA micro-benchmarks. We extracted GPU SDC characteristics in terms of corruption types, bit-flip behavior, and warp-aligned spatial correlation. Our results show that NaN/+INF/-INF account for only 1.01% of SDC outcomes, that single-bit flips constitute less than 40% of bit-flip events, and that corruption addresses exhibit periodicity. These statistics motivate distribution-aware high-level fault modeling and realistic software-based fault injection for resilience evaluation of production-class GPU architectures.