NANASep 16, 2018

Low synchronization GMRES algorithms

arXiv:1809.058056 citations
AI Analysis

For high-performance computing practitioners, this work reduces synchronization bottlenecks in Krylov solvers, enabling faster linear system solutions on exascale architectures.

The paper presents low-synchronization variants of GMRES that require only one or two global reductions per iteration, achieving up to 2x speedup on GPU-based systems while maintaining stability with O(ε)κ(A) accuracy.

Communication-avoiding and pipelined variants of Krylov solvers are critical for the scalability of linear system solvers on future exascale architectures. We present low synchronization variants of iterated classical (CGS) and modified Gram-Schmidt (MGS) algorithms that require one and two global reduction communication steps. Derivations of low synchronization iterated CGS algorithms are based on previous work by Ruhe. Our main contribution is to introduce a backward normalization lag into the compact $WY$ form of MGS resulting in a ${\cal O}(\eps)κ(A)$ stable GMRES algorithm that requires only one global synchronization per iteration. The reduction operations are overlapped with computations and pipelined to optimize performance. Further improvements in performance are achieved by accelerating GMRES BLAS-2 operations on GPUs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes