NASep 5, 2023
Algebraic Temporal Blocking for Sparse Iterative Solvers on Multi-Core CPUsChristie Alappat, Jonas Thies, Georg Hager et al.
Sparse linear iterative solvers are essential for many large-scale simulations. Much of the runtime of these solvers is often spent in the implicit evaluation of matrix polynomials via a sequence of sparse matrix-vector products. A variety of approaches has been proposed to make these polynomial evaluations explicit (i.e., fix the coefficients), e.g., polynomial preconditioners or s-step Krylov methods. Furthermore, it is nowadays a popular practice to approximate triangular solves by a matrix polynomial to increase parallelism. Such algorithms allow to evaluate the polynomial using a so-called matrix power kernel (MPK), which computes the product between a power of a sparse matrix A and a dense vector x, or a related operation. Recently we have shown that using the level-based formulation of sparse matrix-vector multiplications in the Recursive Algebraic Coloring Engine (RACE) framework we can perform temporal cache blocking of MPK to increase its performance. In this work, we demonstrate the application of this cache-blocking optimization in sparse iterative solvers. By integrating the RACE library into the Trilinos framework, we demonstrate the speedups achieved in preconditioned) s-step GMRES, polynomial preconditioners, and algebraic multigrid (AMG). For MPK-dominated algorithms we achieve speedups of up to 3x on modern multi-core compute nodes. For algorithms with moderate contributions from subspace orthogonalization, the gain reduces significantly, which is often caused by the insufficient quality of the orthogonalization routines. Finally, we showcase the application of RACE-accelerated solvers in a real-world wind turbine simulation (Nalu-Wind) and highlight the new opportunities and perspectives opened up by RACE as a cache-blocking technique for MPK-enabled sparse solvers.
NAJun 9, 2010
A robust two-level incomplete factorization for (Navier-)Stokes saddle point matricesFred Wubs, Jonas Thies
We present a new hybrid direct/iterative approach to the solution of a special class of saddle point matrices arising from the discretization of the steady incompressible Navier-Stokes equations on an Arakawa C-grid. The two-level method introduced here has the following properties: (i) it is very robust, even close to the point where the solution becomes unstable; (ii) a single parameter controls fill and convergence, making the method straightforward to use; (iii) the convergence rate is independent of the number of unknowns; (iv) it can be implemented on distributed memory machines in a natural way; (v) the matrix on the second level has the same structure and numerical properties as the original problem, so the method can be applied recursively; (vi) the iteration takes place in the divergence- free space, so the method qualifies as a 'constraint preconditioner'; (vii) the approach can also be applied to Poisson problems. This work is also relevant for problems in which similar saddle point matrices occur, for instance when simulating electrical networks, where one has to satisfy Kirchhoff's conservation law for currents.
MSMar 21
Implementation of QR factorization of tall and very skinny matrices on current GPUsJonas Thies, Melven Röhrig-Zöllner
We consider the problem of computing a QR (or QZ) decomposition of a real, dense, tall and very skinny matrix. That is, the number of columns is tiny compared to the number of rows, rendering most computations completely or partially memory-bandwidth limited. The paper focuses on recent NVIDIA GPGPUs still supporting 64-bit floating-point arithmetic, but the findings carry over to AMD GPUs as well. We discuss two basic algorithms: Methods based on the normal equations (Gram matrix), in particular Cholesky-QR2 and SVQB, and the "tall-skinny QR" (TSQR), based on Householder transformations in a tree-reduction scheme. We propose two primary optimization techniques: Avoiding the write-back of the Q factor ("Q-less QR"), and exploiting fast local memory (shared memory on GPUs). We compare a straight-forward implementation of Gramian-based methods, and a more sophisticated TSQR implementation, in terms of performance achieved, time-to-solution, and implementation complexity. By performance modelling and numerical experiments with our own code and a vendor-optimized library routine, we demonstrate the crucial need for specialized methods and implementations in this memory-bound to transitional (memory/compute-bound) regime, and that TSQR is competitive in terms of time-to-solution, but at the cost of an investment in low-level code optimization.
SEDec 10, 2021
(R)SE challenges in HPCJonas Thies, Melven Röhrig-Zöllner, Achim Basermann
We discuss some specific software engineering challenges in the field of high-performance computing, and argue that the slow adoption of SE tools and techniques is at least in part caused by the fact that these do not address the HPC challenges `out-of-the-box'. By giving some examples of solutions for designing, testing and benchmarking HPC software, we intend to bring software engineering and HPC closer together.