MSNANAMar 21

Implementation of QR factorization of tall and very skinny matrices on current GPUs

arXiv:2603.2088924.3h-index: 12
AI Analysis

This addresses a memory-bound computational bottleneck for researchers and engineers working with large-scale linear algebra on GPUs, though it is incremental as it builds on existing algorithms.

The paper tackles the problem of computing QR factorization for tall and very skinny matrices on GPUs, where memory bandwidth limits performance, and finds that specialized methods like TSQR are competitive in time-to-solution but require significant low-level optimization.

We consider the problem of computing a QR (or QZ) decomposition of a real, dense, tall and very skinny matrix. That is, the number of columns is tiny compared to the number of rows, rendering most computations completely or partially memory-bandwidth limited. The paper focuses on recent NVIDIA GPGPUs still supporting 64-bit floating-point arithmetic, but the findings carry over to AMD GPUs as well. We discuss two basic algorithms: Methods based on the normal equations (Gram matrix), in particular Cholesky-QR2 and SVQB, and the "tall-skinny QR" (TSQR), based on Householder transformations in a tree-reduction scheme. We propose two primary optimization techniques: Avoiding the write-back of the Q factor ("Q-less QR"), and exploiting fast local memory (shared memory on GPUs). We compare a straight-forward implementation of Gramian-based methods, and a more sophisticated TSQR implementation, in terms of performance achieved, time-to-solution, and implementation complexity. By performance modelling and numerical experiments with our own code and a vendor-optimized library routine, we demonstrate the crucial need for specialized methods and implementations in this memory-bound to transitional (memory/compute-bound) regime, and that TSQR is competitive in terms of time-to-solution, but at the cost of an investment in low-level code optimization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes