NAJan 7, 2025
Communication-reduced Conjugate Gradient Variants for GPU-accelerated ClustersMassimo Bernaschi, Mauro G. Carrozzo, Alessandro Celestini et al.
Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.
84.0NAMar 27
Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram SolvePasqua D'Ambra, Massimo Bernaschi, Mauro G. Carrozzo et al.
We present a variant of the s-step Preconditioned Conjugate Gradient (PCG) method that combines a Chebyshev-stabilized Krylov basis with a Forward Gauss-Seidel (FGS) iteration for the solution of the reduced Gram systems. In s-step Conjugate Gradient, multiple search directions are generated per outer iteration, reducing global synchronization costs but requiring the solution of small dense Gram systems whose conditioning is critical for stability. We analyze the structure of the Chebyshev Gram matrix and show that its moment-based representation is associated with favorable conditioning properties for moderate step sizes. Building on inexact Krylov theory and on the classical equivalence between FGS and Modified Gram-Schmidt (MGS), we provide a structural analysis and theoretical rationale supporting the use of a small number of FGS sweeps, while preserving the convergence behavior observed in practical regimes. Large-scale experiments on modern NVIDIA GPU architectures demonstrate that the proposed Chebyshev-stabilized, Gauss-Seidel-enhanced s-step PCG achieves convergence comparable to classical CG while reducing synchronization overhead, making it a stable and scalable alternative for current and next-generation accelerator systems.