NAAug 19, 2008
Communication-optimal parallel and sequential QR and LU factorizationsJames Demmel, Laura Grigori, Mark Hoemmen et al.
We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. We prove optimality by extending known lower bounds on communication bandwidth for sequential and parallel matrix multiplication to provide latency lower bounds, and show these bounds apply to the LU and QR decompositions. We not only show that our QR algorithms attain these lower bounds (up to polylogarithmic factors), but that existing LAPACK and ScaLAPACK algorithms perform asymptotically more communication. We also point out recent LU algorithms in the literature that attain at least some of these lower bounds.
NAJul 24, 2007
Parallel Tiled QR Factorization for Multicore ArchitecturesAlfredo Buttari, Julien Langou, Jakub Kurzak et al.
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
MSFeb 22, 2010
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore ArchitecturesEmmanuel Agullo, Henricus Bouwmeester, Jack Dongarra et al.
The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. In this extended abstract, we attack the problem of the computation of the inverse of a symmetric positive definite matrix. We observe that, using a dynamic task scheduler, it is relatively painless to translate existing LAPACK code to obtain a ready-to-be-executed tile algorithm. However we demonstrate that non trivial compiler techniques (array renaming, loop reversal and pipelining) need then to be applied to further increase the parallelism of our application. We present preliminary experimental results.
NASep 14, 2008
Implementing Communication-Optimal Parallel and Sequential QR FactorizationsJames Demmel, Laura Grigori, Mark Hoemmen et al.
We present parallel and sequential dense QR factorization algorithms for tall and skinny matrices and general rectangular matrices that both minimize communication, and are as stable as Householder QR. The sequential and parallel algorithms for tall and skinny matrices lead to significant speedups in practice over some of the existing algorithms, including LAPACK and ScaLAPACK, for example up to 6.7x over ScaLAPACK. The parallel algorithm for general rectangular matrices is estimated to show significant speedups over ScaLAPACK, up to 22x over ScaLAPACK.
NASep 18, 2008
The Problem with the Linpack Benchmark Matrix GeneratorJack Dongarra, Julien Langou
We characterize the matrix sizes for which the Linpack Benchmark matrix generator constructs a matrix with identical columns.
NAApr 13, 2018
Fast Parallel Randomized QR with Column Pivoting Algorithms for Reliable Low-rank Matrix ApproximationsJianwei Xiao, Ming Gu, Julien Langou
Factorizing large matrices by QR with column pivoting (QRCP) is substantially more expensive than QR without pivoting, owing to communication costs required for pivoting decisions. In contrast, randomized QRCP (RQRCP) algorithms have proven themselves empirically to be highly competitive with high-performance implementations of QR in processing time, on uniprocessor and shared memory machines, and as reliable as QRCP in pivot quality. We show that RQRCP algorithms can be as reliable as QRCP with failure probabilities exponentially decaying in oversampling size. We also analyze efficiency differences among different RQRCP algorithms. More importantly, we develop distributed memory implementations of RQRCP that are significantly better than QRCP implementations in ScaLAPACK. As a further development, we introduce the concept of and develop algorithms for computing spectrum-revealing QR factorizations for low-rank matrix approximations, and demonstrate their effectiveness against leading low-rank approximation methods in both theoretical and numerical reliability and efficiency.
NAAug 29, 2008
Communication-optimal parallel and sequential QR and LU factorizations: theory and practiceJames Demmel, Laura Grigori, Mark Hoemmen et al.
We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny QR (TSQR), factors m-by-n matrices in a one-dimensional (1-D) block cyclic row layout, and is optimized for m >> n. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization.
NAJun 19, 2008
The cycle-convergence of restarted GMRES for normal matrices is sublinearEugene Vecharynski, Julien Langou
We prove that the cycle-convergence of the restarted GMRES applied to a system of linear equations with a normal coefficient matrix is sublinear.
NAOct 3, 2007
Computing the Conditioning of the Components of a Linear Least Squares SolutionMarc Baboulin, Jack Dongarra, Serge Gratton et al.
In this paper, we address the accuracy of the results for the overdetermined full rank linear least squares problem. We recall theoretical results obtained in Arioli, Baboulin and Gratton, SIMAX 29(2):413--433, 2007, on conditioning of the least squares solution and the components of the solution when the matrix perturbations are measured in Frobenius or spectral norms. Then we define computable estimates for these condition numbers and we interpret them in terms of statistical quantities. In particular, we show that, in the classical linear statistical model, the ratio of the variance of one component of the solution by the variance of the right-hand side is exactly the condition number of this solution component when perturbations on the right-hand side are considered. We also provide fragment codes using LAPACK routines to compute the variance-covariance matrix and the least squares conditioning and we give the corresponding computational cost. Finally we present a small historical numerical example that was used by Laplace in Theorie Analytique des Probabilites, 1820, for computing the mass of Jupiter and experiments from the space industry with real physical data.
NAFeb 23, 2010
Computing the R of the QR factorization of tall and skinny matrices using MPI_ReduceJulien Langou
A QR factorization of a tall and skinny matrix with n columns can be represented as a reduction. The operation used along the reduction tree has in input two n-by-n upper triangular matrices and in output an n-by-n upper triangular matrix which is defined as the R factor of the two input matrices stacked the one on top of the other. This operation is binary, associative, and commutative. We can therefore leverage the MPI library capabilities by using user-defined MPI operations and MPI_Reduce to perform this reduction. The resulting code is compact and portable. In this context, the user relies on the MPI library to select a reduction tree appropriate for the underlying architecture.
NASep 27, 2024
Probabilistic Analysis of Least Squares, Orthogonal Projection, and QR Factorization Algorithms Subject to Gaussian NoiseAli Lotfi, Julien Langou, Mohammad Meysami
In this paper, we extend the work of Liesen et al. (2002), which analyzes how the condition number of an orthonormal matrix Q changes when a column is added ([Q, c]), particularly focusing on the perpendicularity of c to the span of Q. Their result, presented in Theorem 2.3 of Liesen et al. (2002), assumes exact arithmetic and orthonormality of Q, which is a strong assumption when applying these results to numerical methods such as QR factorization algorithms. In our work, we address this gap by deriving bounds on the condition number increase for a matrix B without assuming perfect orthonormality, even when a column is not perfectly orthogonal to the span of B. This framework allows us to analyze QR factorization methods where orthogonalization is imperfect and subject to Gaussian noise. We also provide results on the performance of orthogonal projection and least squares under Gaussian noise, further supporting the development of this theory.
NASep 16, 2018
Low synchronization GMRES algorithmsKasia Swirydowicz, Julien Langou, Shreyas Ananthan et al.
Communication-avoiding and pipelined variants of Krylov solvers are critical for the scalability of linear system solvers on future exascale architectures. We present low synchronization variants of iterated classical (CGS) and modified Gram-Schmidt (MGS) algorithms that require one and two global reduction communication steps. Derivations of low synchronization iterated CGS algorithms are based on previous work by Ruhe. Our main contribution is to introduce a backward normalization lag into the compact $WY$ form of MGS resulting in a ${\cal O}(\eps)κ(A)$ stable GMRES algorithm that requires only one global synchronization per iteration. The reduction operations are overlapped with computations and pipelined to optimize performance. Further improvements in performance are achieved by accelerating GMRES BLAS-2 operations on GPUs.
DCDec 14, 2009
QR Factorization of Tall and Skinny Matrices in a Grid Computing EnvironmentEmmanuel Agullo, Camille Coti, Jack Dongarra et al.
Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).
NAJul 27, 2009
Translation and modern interpretation of Laplace's Théorie Analytique des Probabilités, pages 505-512, 516-520Julien Langou
The text of Laplace, \textit{Sur l'application du calcul des probabilités à la philosophie naturelle,} (Théorie Analytique des Probabilités. Troisième Édition. Premier Supplément), 1820, is quoted in the context of the Gram-Schmidt algorithm. We provide an English translation of Laplace's manuscript (originally in French) and interpret the algorithms of Laplace in a contemporary context. The two algorithms given by Laplace computes the mean and the variance of two components of the solution of a linear statistical model. The first algorithm can be interpreted as {\em reverse square-root-free modified Gram-Schmidt by row} algorithm on the regression matrix. The second algorithm can be interpreted as the {\em reverse square-root-free Cholesky} algorithm.
NAJul 21, 2009
Any decreasing cycle-convergence curve is possible for restarted GMRESEugene Vecharynski, Julien Langou
Given a matrix order $n$, a restart parameter $m$ ($m < n$), a decreasing positive sequence $f(0) > f(1) > ... > f(q) \geq 0$, where $q < n/m$, it is shown that there exits an $n$-by-$n$ matrix $A$ and a vector $r_0$ with $\|r_0\|=f(0)$ such that $\|r_k\|=f(k)$, $k=1,...,q$, where $r_k$ is the residual at cycle $k$ of restarted GMRES with restart parameter $m$ applied to the linear system $Ax=b$, with initial residual $r_0=b-Ax_0$. Moreover, the matrix $A$ can be chosen to have any desired eigenvalues. We can also construct arbitrary cases of stagnation; namely, when $f(0) > f(1) > ... > f(i) = f(i+1) \geq 0 $ for any $i <q $. The restart parameter can be fixed or variable.
NAAug 13, 2008
A note on the error analysis of classical Gram-SchmidtAlicja Smoktunowicz, Jesse L. Barlow, Julien Langou
An error analysis result is given for classical Gram--Schmidt factorization of a full rank matrix $A$ into $A=QR$ where $Q$ is left orthogonal (has orthonormal columns) and $R$ is upper triangular. The work presented here shows that the computed $R$ satisfies $\normal{R}=\normal{A}+E$ where $E$ is an appropriately small backward error, but only if the diagonals of $R$ are computed in a manner similar to Cholesky factorization of the normal equations matrix. A similar result is stated in [Giraud at al, Numer. Math. 101(1):87--100,2005]. However, for that result to hold, the diagonals of $R$ must be computed in the manner recommended in this work.