25.0NAApr 30
Parallel matching-based AMG preconditioners for elliptic equations discretized by IgAPasqua D'Ambra, Fabio Durastante, Salvatore Filippone
Isogeometric analysis (IgA) offers enhanced approximation capabilities for the discretization of elliptic boundary-value problems, yet it results in large, sparse, and increasingly ill-conditioned linear systems due to higher interconnectivity among degrees of freedom. In particular, the discretization with tensor-product B-splines or NURBS of degree $p$ on a mesh with $n$ elements per parametric direction leads to symmetric positive-definite systems of the form $K\mathbf{u} = \mathbf{F}$, where the matrix bandwidth and condition number scale unfavorably with both $p$ and spatial dimension $d$. To address the computational challenges posed by such systems, especially in three-dimensional or high-order scenarios, Krylov subspace methods with specialized preconditioners become essential. This paper investigates the efficacy of algebraic multigrid (AMG) preconditioners tailored for IgA-based discretizations, with a focus on performance in modern high-performance computing (HPC) environments. Leveraging the Parallel Sparse Computation Toolkit (PSCToolkit), we explore distributed-memory and GPU-accelerated strategies for solving large-scale problems. The study assesses algorithmic efficiency and scalability across a range of benchmark tests. The results demonstrate that AMG preconditioners can achieve robust and scalable performance, confirming their potential as practical solvers for large IgA systems in engineering and scientific applications.
23.2DCApr 15
On the energy efficiency of sparse matrix computations on multi-GPU clustersMassimo Bernaschi, Alessandro Celestini, Pasqua D'Ambra et al.
We investigate the energy efficiency of a library designed for parallel computations with sparse matrices. The library leverages high-performance, energy-efficient Graphics Processing Unit (GPU) accelerators to enable large-scale scientific applications. Our primary development objective was to maximize parallel performance and scalability in solving sparse linear systems whose dimensions far exceed the memory capacity of a single node. To this end, we devised methods that expose a high degree of parallelism while optimizing algorithmic implementations for efficient multi-GPU usage. Previous work has already demonstrated the library's performance efficiency on large-scale systems comprising thousands of NVIDIA GPUs, achieving improvements over state-of-the-art solutions. In this paper, we extend those results by providing energy profiles that address the growing sustainability requirements of modern HPC platforms. We present our methodology and tools for accurate runtime energy measurements of the library's core components and discuss the findings. Our results confirm that optimizing GPU computations and minimizing data movement across memory and computing nodes reduces both time-to-solution and energy consumption. Moreover, we show that the library delivers substantial advantages over comparable software frameworks on standard benchmarks.
NAJan 7, 2025
Communication-reduced Conjugate Gradient Variants for GPU-accelerated ClustersMassimo Bernaschi, Mauro G. Carrozzo, Alessandro Celestini et al.
Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in datadriven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.
84.0NAMar 27
Scalable s-step Preconditioned Conjugate Gradient with Chebyshev Basis and Gauss-Seidel Gram SolvePasqua D'Ambra, Massimo Bernaschi, Mauro G. Carrozzo et al.
We present a variant of the s-step Preconditioned Conjugate Gradient (PCG) method that combines a Chebyshev-stabilized Krylov basis with a Forward Gauss-Seidel (FGS) iteration for the solution of the reduced Gram systems. In s-step Conjugate Gradient, multiple search directions are generated per outer iteration, reducing global synchronization costs but requiring the solution of small dense Gram systems whose conditioning is critical for stability. We analyze the structure of the Chebyshev Gram matrix and show that its moment-based representation is associated with favorable conditioning properties for moderate step sizes. Building on inexact Krylov theory and on the classical equivalence between FGS and Modified Gram-Schmidt (MGS), we provide a structural analysis and theoretical rationale supporting the use of a small number of FGS sweeps, while preserving the convergence behavior observed in practical regimes. Large-scale experiments on modern NVIDIA GPU architectures demonstrate that the proposed Chebyshev-stabilized, Gauss-Seidel-enhanced s-step PCG achieves convergence comparable to classical CG while reducing synchronization overhead, making it a stable and scalable alternative for current and next-generation accelerator systems.
NAOct 9, 2018Code
AMG based on compatible weighted matching for GPUsMassimo Bernaschi, Pasqua D'Ambra, Dario Pasquini
We describe main issues and design principles of an efficient implementation, tailored to recent generations of Nvidia Graphics Processing Units (GPUs), of an Algebraic Multigrid (AMG) preconditioner previously proposed by one of the authors and already available in the open-source package BootCMatch: Bootstrap algebraic multigrid based on Compatible weighted Matching for standard CPU. The AMG method relies on a new approach for coarsening sparse symmetric positive definite (spd) matrices, named "coarsening based on compatible weighted matching". It exploits maximum weight matching in the adjacency graph of the sparse matrix, driven by the principle of compatible relaxation, providing a suitable aggregation of unknowns which goes beyond the limits of the usual heuristics applied in the current methods. We adopt an approximate solution of the maximum weight matching problem, based on a recently proposed parallel algorithm, referred as the Suitor algorithm, and show that it allow us to obtain good quality coarse matrices for our AMG on GPUs. We exploit inherent parallelism of modern GPUs in all the kernels involving sparse matrix computations both for the setup of the preconditioner and for its application in a Krylov solver, outperforming preconditioners available in Nvidia AmgX library. We report results about a large set of linear systems arising from discretization of scalar and vector partial differential equations (PDEs).
LGSep 20, 2021
Extending Bootstrap AMG for Clustering of Attributed GraphsPasqua D'Ambra, Panayot S. Vassilevski, Luisa Cutillo
In this paper we propose a new approach to detect clusters in undirected graphs with attributed vertices. We incorporate structural and attribute similarities between the vertices in an augmented graph by creating additional vertices and edges as proposed in [1, 2]. The augmented graph is then embedded in a Euclidean space associated to its Laplacian and we cluster vertices via a modified K-means algorithm, using a new vector-valued distance in the embedding space. Main novelty of our method, which can be classified as an early fusion method, i.e., a method in which additional information on vertices are fused to the structure information before applying clustering, is the interpretation of attributes as new realizations of graph vertices, which can be dealt with as coordinate vectors in a related Euclidean space. This allows us to extend a scalable generalized spectral clustering procedure which substitutes graph Laplacian eigenvectors with some vectors, named algebraically smooth vectors, obtained by a linear-time complexity Algebraic MultiGrid (AMG) method. We discuss the performance of our proposed clustering method by comparison with recent literature approaches and public available results. Extensive experiments on different types of synthetic datasets and real-world attributed graphs show that our new algorithm, embedding attributes information in the clustering, outperforms structure-only-based methods, when the attributed network has an ambiguous structure. Furthermore, our new method largely outperforms the method which originally proposed the graph augmentation, showing that our embedding strategy and vector-valued distance are very effective in taking advantages from the augmented-graph representation.