47.1ARMay 5
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling GuidanceChung-Hsuan Tung, Yanxiang Huang, Nirmal Saxena et al.
Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this gap, we conducted a large-scale gate-level stuck-at fault injection on a production-class data-center GPU, consuming over three million simulator hours across 63 CUDA micro-benchmarks. We extracted GPU SDC characteristics in terms of corruption types, bit-flip behavior, and warp-aligned spatial correlation. Our results show that NaN/+INF/-INF account for only 1.01% of SDC outcomes, that single-bit flips constitute less than 40% of bit-flip events, and that corruption addresses exhibit periodicity. These statistics motivate distribution-aware high-level fault modeling and realistic software-based fault injection for resilience evaluation of production-class GPU architectures.
41.3ARApr 12
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM TrainingAbhishek Tyagi, Saurabh Hukerikar, Nirmal Saxena et al.
Large-scale LLM training is increasingly susceptible to hardware defects stemming from manufacturing escapes and silicon aging. These defects manifest as Silent Data Corruption (SDC) that perturb gradients and parameters throughout the training process. We present LLM-PRISM, a methodology to characterize LLM pre-training resilience to hardware faults. LLM-PRISM couples RTL-level GPU fault simulation with a stochastic injection engine embedded in Megatron-LM. Through 7,664 training runs across FP16, BF16, and FP8 regimes, we analyze how fault type, rate, and numeric format govern resilience. We find that while LLMs resist low-frequency faults, impact is highly non-uniform; critical datapaths and specific precision formats can induce catastrophic divergence even at moderate fault rates. This study provides the first hardware-grounded, pre-training characterization of SDC resilience.
8.8NIMar 20
RISE: Real-time Image Processing for Spectral Energy Detection and LocalizationChung-Hsuan Tung, Zhenzhou Qi, Tingjun Chen
Energy detection is widely used for spectrum sensing, but accurately localizing the time and frequency occupation of signals in real-time for efficient spectrum sharing remains challenging. To address this challenge, we present RISE, a software-based spectrum sensing system designed for real-time signal detection and localization. RISE treats time-frequency spectrum plots as images and applies adaptive thresholding, morphological operations, and connected component labeling with a multi-threaded architecture. We evaluate RISE using both synthetic data and controlled over-the-air (OTA) experiments across diverse signal types. Results show that RISE satisfies real-time latency constraints while achieving a probability of detection of 80.42% at an intersection-over-union (IoU) threshold of 0.4. RISE sustains a raw I/Q input rate of 3.2 Gbps for 100 MHz bandwidth sensing with time and frequency resolutions of 10.24 us and 97.6 kHz, respectively. Compared to Searchlight, a representative energy-based method, RISE achieves 20.51x lower latency and 22.31% higher IoU. Compared to machine learning baselines, RISE improves IoU by 56.02% over DeepRadar while meeting the real-time deadline, which a GPU-accelerated U-Net exceeds by 213.38x.
72.9SPApr 2
Real-Time and Scalable Zak-OTFS Receiver Processing on GPUsJunyao Zheng, Chung-Hsuan Tung, Yuncheng Yao et al.
Orthogonal time frequency space (OTFS) modulation offers superior robustness to high-mobility channels compared to conventional orthogonal frequency-division multiplexing (OFDM) waveforms. However, its explicit delay-Doppler (DD) domain representation incurs substantial signal processing complexity, especially with increased DD domain grid sizes. To address this challenge, we present a scalable, real-time Zak-OTFS receiver architecture on GPUs through hardware--algorithm co-design that exploits DD-domain channel sparsity. Our design leverages compact matrix operations for key processing stages, a branchless iterative equalizer, and a structured sparse channel matrix of the DD domain channel matrix to significantly reduce computational and memory overhead. These optimizations enable low-latency processing that consistently meets the 99.9-th percentile real-time processing deadline. The proposed system achieves up to 906.52 Mbps throughput with a DD grid size of (16384,32) using 16QAM modulation over 245.76 MHz bandwidth. Extensive evaluations under a Vehicular-A channel model demonstrate strong scalability and robust performance across CPU (Intel Xeon) and multiple GPU platforms (NVIDIA Jetson Orin, RTX 6000 Ada, A100, and H200), highlighting the effectiveness of compute-aware Zak-OTFS receiver design for next-generation (NextG) high-mobility communication systems.