NADCNAJul 28, 2016

Is the Multigrid Method Fault Tolerant? The Multilevel Case

arXiv:1607.085026 citations
Originality Incremental advance
AI Analysis

For high-performance computing users relying on multigrid solvers, this work identifies a critical vulnerability and offers practical protection strategies, though the analysis is an incremental extension of prior two-grid results.

The paper analyzes the fault resilience of the multigrid method for solving linear systems, showing that it is not fault-tolerant unless the prolongation operation is protected. It provides strategies for fault detection and mitigation, and derives optimal parameter choices, demonstrating near-ideal performance in fault-prone environments.

Computing at the exascale level is expected to be affected by a significantly higher rate of faults, due to increased component counts as well as power considerations. Therefore, current day numerical algorithms need to be reexamined as to determine if they are fault resilient, and which critical operations need to be safeguarded in order to obtain performance that is close to the ideal fault-free method. In a previous paper, a framework for the analysis of random stationary linear iterations was presented and applied to the two grid method. The present work is concerned with the multigrid algorithm for the solution of linear systems of equations, which is widely used on high performance computing systems. It is shown that the Fault-Prone Multigrid Method is not resilient, unless the prolongation operation is protected. Strategies for fault detection and mitigation as well as protection of the prolongation operation are presented and tested, and a guideline for an optimal choice of parameters is devised.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes