45.5NAMay 27
Factorized Krylov subspace methods for solving large Sylvester equationsYuki Satake, Takeshi Fukaya, Tomohiro Sogabe et al.
Krylov subspace methods, such as the Conjugate Gradient (CG) and BiCGSTAB methods, are widely used in scientific computing for solving linear systems. In this study, we propose a new framework for solving large Sylvester equations in a low-rank format by reconstructing matrix-oriented Krylov subspace methods. The framework realizes efficient algorithms that are mathematically equivalent to the matrix-oriented Krylov subspace methods by exploiting the mathematical properties of the Sylvester operator and the low-rank structure of the right-hand side. Specifically, by leveraging these properties, approximate solutions can be expressed in a low-rank factorized form, enabling efficient computation and reduced memory requirements. The effectiveness of our algorithms is demonstrated through numerical experiments.
COMP-PHDec 19, 2018Code
EigenKernel - A middleware for parallel generalized eigenvalue solvers to attain high scalability and usabilityKazuyuki Tanaka, Hiroto Imachi, Tomoya Fukumoto et al.
An open-source middleware EigenKernel was developed for use with parallel generalized eigenvalue solvers or large-scale electronic state calculation to attain high scalability and usability. The middleware enables the users to choose the optimal solver, among the three parallel eigenvalue libraries of ScaLAPACK, ELPA, EigenExa and hybrid solvers constructed from them, according to the problem specification and the target architecture. The benchmark was carried out on the Oakforest-PACS supercomputer and reveals that ELPA, EigenExa and their hybrid solvers show better performance, when compared with pure ScaLAPACK solvers. The benchmark on the K computer is also used for discussion. In addition, a preliminary research for the performance prediction was investigated, so as to predict the elapsed time T as the function of the number of used nodes P (T=T(P)). The prediction is based on Bayesian inference using the Markov Chain Monte Carlo (MCMC) method and the test calculation indicates that the method is applicable not only to performance interpolation but also to extrapolation. Such a middleware is of crucial importance for application-algorithm-architecture co-design among the current, next-generation (exascale), and future-generation (post-Moore era) supercomputers.
NASep 28, 2018
Shifted CholeskyQR for computing the QR factorization of ill-conditioned matricesTakeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa et al.
The Cholesky QR algorithm is an efficient communication-minimizing algorithm for computing the QR factorization of a tall-skinny matrix. Unfortunately it has the inherent numerical instability and breakdown when the matrix is ill-conditioned. A recent work establishes that the instability can be cured by repeating the algorithm twice (called CholeskyQR2). However, the applicability of CholeskyQR2 is still limited by the requirement that the Cholesky factorization of the Gram matrix runs to completion, which means it does not always work for matrices $X$ with $κ_2(X)\gtrsim {\bf u}^{-\frac{1}{2}}$ where ${\bf u}$ is the unit roundoff. In this work we extend the applicability to $κ_2(X)=\mathcal{O}({\bf u}^{-1})$ by introducing a shift to the computed Gram matrix so as to guarantee the Cholesky factorization $R^TR= A^TA+sI$ succeeds numerically. We show that the computed $AR^{-1}$ has reduced condition number $\leq {\bf u}^{-\frac{1}{2}}$, for which CholeskyQR2 safely computes the QR factorization, yielding a computed $Q$ of orthogonality $\|Q^TQ-I\|_2$ and residual $\|A-QR\|_F/\|A\|_F$ both $\mathcal{O}({\bf u})$. Thus we obtain the required QR factorization by essentially running Cholesky QR thrice. We extensively analyze the resulting algorithm shiftedCholeskyQR to reveal its excellent numerical stability. shiftedCholeskyQR is also highly parallelizable, and applicable and effective also when working in an oblique inner product space. We illustrate our findings through experiments, in which we achieve significant (up to x40) speedup over alternative methods.