NAOct 5, 2008
Condition Numbers of Gaussian Random MatricesZizhong Chen, Jack Dongarra
Let $G_{m \times n}$ be an $m \times n$ real random matrix whose elements are independent and identically distributed standard normal random variables, and let $κ_2(G_{m \times n})$ be the 2-norm condition number of $G_{m \times n}$. We prove that, for any $m \geq 2$, $n \geq 2$ and $x \geq |n-m|+1$, $κ_2(G_{m \times n})$ satisfies $ \frac{1}{\sqrt{2π}} ({c}/{x})^{|n-m|+1} < P(\frac{κ_2(G_{m \times n})} {{n}/{(|n-m|+1)}}> x) < \frac{1}{\sqrt{2π}} ({C}/{x})^{|n-m|+1}, $ where $0.245 \leq c \leq 2.000$ and $ 5.013 \leq C \leq 6.414$ are universal positive constants independent of $m$, $n$ and $x$. Moreover, for any $m \geq 2$ and $n \geq 2$, $ E(\logκ_2(G_{m \times n})) < \log \frac{n}{|n-m|+1} + 2.258. $ A similar pair of results for complex Gaussian random matrices is also established.
NAJul 24, 2007
Parallel Tiled QR Factorization for Multicore ArchitecturesAlfredo Buttari, Julien Langou, Jakub Kurzak et al.
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can only be exploited at the level of the BLAS operations.
MSFeb 22, 2010
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore ArchitecturesEmmanuel Agullo, Henricus Bouwmeester, Jack Dongarra et al.
The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. In this extended abstract, we attack the problem of the computation of the inverse of a symmetric positive definite matrix. We observe that, using a dynamic task scheduler, it is relatively painless to translate existing LAPACK code to obtain a ready-to-be-executed tile algorithm. However we demonstrate that non trivial compiler techniques (array renaming, loop reversal and pipelining) need then to be applied to further increase the parallelism of our application. We present preliminary experimental results.
NAJul 7, 2012
A hybrid Hermitian general eigenvalue solverRaffaele Solcà, Thomas C. Schulthess, Azzam Haidar et al.
The adoption of hybrid GPU-CPU nodes in traditional supercomputing platforms opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium sized Hermitian generalized eigenvalue problems must be solved many times. The small size of the problems limits the scalability on a distributed memory system, hence they can benefit from the massive computational performance concentrated on a single node, hybrid GPU-CPU system. However, new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multi/many-core CPUs as well are required. Addressing these demands, we implemented a novel Hermitian general eigensolver algorithm. This algorithm is based on a standard eigenvalue solver, and existing algorithms can be used. The resulting eigensolvers are state-of-the-art in HPC, significantly outperforming existing libraries. We analyze their performance impact on applications of interest, when different fractions of eigenvectors are needed by the host electronic structure code.
NASep 18, 2008
The Problem with the Linpack Benchmark Matrix GeneratorJack Dongarra, Julien Langou
We characterize the matrix sizes for which the Linpack Benchmark matrix generator constructs a matrix with identical columns.
NAOct 3, 2007
Computing the Conditioning of the Components of a Linear Least Squares SolutionMarc Baboulin, Jack Dongarra, Serge Gratton et al.
In this paper, we address the accuracy of the results for the overdetermined full rank linear least squares problem. We recall theoretical results obtained in Arioli, Baboulin and Gratton, SIMAX 29(2):413--433, 2007, on conditioning of the least squares solution and the components of the solution when the matrix perturbations are measured in Frobenius or spectral norms. Then we define computable estimates for these condition numbers and we interpret them in terms of statistical quantities. In particular, we show that, in the classical linear statistical model, the ratio of the variance of one component of the solution by the variance of the right-hand side is exactly the condition number of this solution component when perturbations on the right-hand side are considered. We also provide fragment codes using LAPACK routines to compute the variance-covariance matrix and the least squares conditioning and we give the corresponding computational cost. Finally we present a small historical numerical example that was used by Laplace in Theorie Analytique des Probabilites, 1820, for computing the mass of Jupiter and experiments from the space industry with real physical data.
25.5NAMar 27
Analysis of Floating-Point Matrix Multiplication Computed via Integer ArithmeticAhmad Abdelfattah, Jack Dongarra, Massimiliano Fasi et al.
Ootomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer slices, the product of these slices is computed exactly, and AB is approximated by accumulating these integer products in floating-point arithmetic. This technique is particularly well suited to mixed-precision matrix multiply-accumulate units with integer support, such as the NVIDIA tensor cores or the AMD matrix cores. The number of slices allows for performance-accuracy tradeoffs: more slices yield better accuracy but require more multiplications, which in turn reduce performance. We propose an inexpensive way to estimate the minimum number of multiplications needed to achieve a prescribed level of accuracy. Our error analysis shows that the algorithm may become inaccurate (or inefficient) if rows of A or columns of B are badly scaled. We perform a range of numerical experiments, both in simulation and on the latest NVIDIA GPUs, that confirm the analysis and illustrate strengths and weaknesses of the algorithm.
LGNov 23, 2020Code
Integrating Deep Learning in Domain Sciences at ExascaleRick Archibald, Edmond Chow, Eduardo D'Azevedo et al.
This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
DCDec 14, 2009
QR Factorization of Tall and Skinny Matrices in a Grid Computing EnvironmentEmmanuel Agullo, Camille Coti, Jack Dongarra et al.
Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).