Lukas Reitz

h-index4

5papers

50citations

5 Papers

7.0DCMay 28

Silent Data Corruption Protection through Efficient Task Replication

Mia Reitz, Claudia Fohry

The trend of increasing cluster sizes of supercomputers leads to a growing susceptibility to Silent Data Corruption (SDC) that can invalidate program results. A common strategy for SDC protection is replication, where the computation is repeated, and the correct result is determined as the one that is the same in at least two different computations. Applying replication to Asynchronous Many-Task (AMT) runtimes on clusters is challenging due to dynamic task spawning and work stealing, which complicate the identification of replicated tasks. To address the challenge, this paper introduces a novel replication scheme that detects and corrects SDCs for nested fork-join programs. Briefly stated, our approach replicates the computation and records the task tree. Upon a mismatch in the final result, it traverses the tree top-down to identify all corrupted tasks that could have impacted the final result. Recovery is then performed by recomputing these tasks, while the results of correct child tasks are reused. We demonstrate our implementation within a variant of the Itoyori cluster AMT runtime. Our experimental results suggest that the time to identify and reprocess the affected tasks is negligible. The paper concludes by discussing the adaptability of our scheme to tasks that cooperate through futures.

9.4DCJul 1

Promise-Future Synchronization for Cluster Asynchronous Many-Task Runtimes via MPI One-Sided Communication

Mia Reitz

Asynchronous Many-Task (AMT) runtimes use futures as placeholders for values produced by other tasks. In the ItoyoriFBC AMT runtime, the existing future-only model binds each future to its producer at creation time and requires the number of tasks that read each future to be fixed at compile time. This prevents directly expressing algorithms that create dependencies dynamically. We extend ItoyoriFBC with an implementation of a promise-future model that lifts these limitations. Thereby, our ItoyoriFBC variant supports dynamic algorithms such as Hierarchical LU factorization (HLU). We experimentally evaluated our implementation using HLU on up to 16 nodes and observed near-ideal scaling with a 15.6x speedup.

3.6DCJun 29

Protecting Futures against Silent Data Corruption -- Efficient Task Replication for Dynamic Data Dependencies

Rüdiger Nather, Claudia Fohry, Mia Reitz

As the size of computational problems grows, so does the likelihood of Silent Data Corruptions (SDCs). A common defense is replication, where the computation is repeated and correct results are determined by majority voting. Asynchronous Many-Task (AMT) runtimes are generally well suited for this approach, since the inputs and outputs of task replicas can be compared, and the tasks can be recomputed if necessary. Most existing SDC protection schemes assume static tasks and dependencies. Dynamic settings are more challenging, especially in clusters, since the tasks/data must be tracked for the comparisons. This paper considers a particularly dynamic setting with task spawning at runtime, task communication through C++11-like promises/futures, conditional touches, and cluster-wide load balancing via work-first work stealing. We propose an approach that closely couples original and replica computations by cross-validating all outgoing effects when interacting with the runtime system. The approach selectively recomputes affected tasks only. We implemented the approach in the ItoyoriFBC runtime system and conducted preliminary experiments with Fibonacci and emulated $\mathcal{H}$-matrix LU decomposition benchmarks. Results show a factor of less than two increase of failure-free running times, despite full replication, which is mainly due to improved opportunities for load balancing resulting from the higher number of tasks. The overhead for failure correction was about 0.5% of the overall running time per SDC.

7.8DCJun 11

Work Stealing for the 2D-Mesh Topology of Satellite Constellations in Low Earth Orbit

Mia Reitz, Dorian Chenet, Jonas Posner

Asynchronous Many-Task (AMT) is a parallel programming model used in High Performance Computing (HPC). An AMT runtime can distribute fine-grained tasks across processing units called workers, through work stealing: when a worker has no tasks left to process, it tries to steal tasks from other workers. Workers are not restricted to a single compute node but can also be distributed across multiple nodes of an HPC cluster. Existing AMT runtimes assume a fully connected network with low, uniform latency and perform global work stealing, selecting another worker at random from all workers in the system. Space Edge Computing (SEC) uses constellations of satellites in Low Earth Orbit (LEO) as distributed compute clusters. Unlike HPC clusters, LEO satellites communicate through inter-satellite links that form a sparse mesh topology. Reaching a distant satellite requires multiple hops, each adding latency. As a step toward adapting AMT to SEC, this paper proposes a neighbor-only work stealing strategy in which workers steal exclusively from directly connected neighbors, avoiding multi-hop communication. An analytical model shows that restricting stealing this way yields a per-attempt latency advantage that grows with constellation size. Preliminary experiments on an HPC cluster with an emulated mesh over uniform low-latency links isolate the effect of victim selection: the neighbor-only strategy performs within ~2.2% of global stealing on both balanced and irregular workloads, indicating that restricting the victim set does not harm load balancing in this setting. Taken together, the experiments suggest that neighbor-only stealing can be on a par with global stealing, and the model suggests that neighbor-only stealing becomes preferable at scale.

7.1DCMar 13

Exploring Performance-Productivity Trade-offs in AMT Runtimes: A Task Bench Study of Itoyori, ItoyoriFBC, HPX, and MPI

Torben R. Lahnor, Mia Reitz, Jonas Posner et al.

Asynchronous Many-Task (AMT) runtimes offer a productive alternative to the Message Passing Interface (MPI). However, the diverse AMT landscape makes fair comparisons challenging. Task Bench, proposed by Slaughter et al., addresses this challenge through a parameterized framework for evaluating parallel programming systems. This work integrates two recent cluster AMTs, Itoyori and ItoyoriFBC, into Task Bench for comprehensive evaluation against MPI and HPX. Itoyori employs a Partitioned Global Address Space (PGAS) model with RDMA-based work stealing, while ItoyoriFBC extends it with futurebased synchronization. We evaluate these systems in terms of both performance and programmer productivity. Performance is assessed across various configurations, including compute-bound kernels, weak scaling, and both imbalanced and communication-intensive patterns. Performance is quantified using application efficiency, i.e., the percentage of maximum performance achieved, and the Minimum Effective Task Granularity (METG), i.e., the smallest task duration before runtime overheads dominate. Programmer productivity is quantified using Lines of Code (LOC) and the Number of Library Constructs (NLC). Our results reveal distinct trade-offs. MPI achieves the highest efficiency for regular, communication-light workloads but requires verbose, lowlevel code. HPX maintains stable efficiency under load imbalance across varying node counts, yet ranks last in productivity metrics, demonstrating that AMTs do not inherently guarantee improved productivity over MPI. Itoyori achieves the highest efficiency in communication-intensive configurations while leading in programmer productivity. ItoyoriFBC exhibits slightly lower efficiency than Itoyori, though its future-based synchronization offers potential for expressing irregular workloads.