DCMay 28

Silent Data Corruption Protection through Efficient Task Replication

arXiv:2605.2950630.7h-index: 9
Predicted impact top 52% in DC · last 90 daysOriginality Incremental advance
AI Analysis

It addresses the challenge of SDC protection in dynamic task-parallel environments on clusters, which is critical for large-scale supercomputing reliability.

The paper introduces a replication scheme for asynchronous many-task runtimes that detects and corrects silent data corruption in nested fork-join programs, with negligible overhead for identifying and reprocessing affected tasks.

The trend of increasing cluster sizes of supercomputers leads to a growing susceptibility to Silent Data Corruption (SDC) that can invalidate program results. A common strategy for SDC protection is replication, where the computation is repeated, and the correct result is determined as the one that is the same in at least two different computations. Applying replication to Asynchronous Many-Task (AMT) runtimes on clusters is challenging due to dynamic task spawning and work stealing, which complicate the identification of replicated tasks. To address the challenge, this paper introduces a novel replication scheme that detects and corrects SDCs for nested fork-join programs. Briefly stated, our approach replicates the computation and records the task tree. Upon a mismatch in the final result, it traverses the tree top-down to identify all corrupted tasks that could have impacted the final result. Recovery is then performed by recomputing these tasks, while the results of correct child tasks are reused. We demonstrate our implementation within a variant of the Itoyori cluster AMT runtime. Our experimental results suggest that the time to identify and reprocess the affected tasks is negligible. The paper concludes by discussing the adaptability of our scheme to tasks that cooperate through futures.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes