NAMar 30, 2017
Efficient implementations of the modified Gram-Schmidt orthogonalization with a non-standard inner productAkira Imakura, Yusaku Yamamoto
The modified Gram-Schmidt (MGS) orthogonalization is one of the most well-used algorithms for computing the thin QR factorization. MGS can be straightforwardly extended to a non-standard inner product with respect to a symmetric positive definite matrix $A$. For the thin QR factorization of an $m \times n$ matrix with the non-standard inner product, a naive implementation of MGS requires $2n$ matrix-vector multiplications (MV) with respect to $A$. In this paper, we propose $n$-MV implementations: a high accuracy (HA) type and a high performance (HP) type, of MGS. We also provide error bounds of the HA-type implementation. Numerical experiments and analysis indicate that the proposed implementations have competitive advantages over the naive implementation in terms of both computational cost and accuracy.
NAFeb 1, 2017
On the optimality and sharpness of Laguerre's lower bound on the smallest eigenvalue of a symmetric positive definite matrixYusaku Yamamoto
Lower bounds on the smallest eigenvalue of a symmetric positive definite matrices $A\in\mathbb{R}^{m\times m}$ play an important role in condition number estimation and in iterative methods for singular value computation. In particular, the bounds based on ${\rm Tr}(A^{-1})$ and ${\rm Tr}(A^{-2})$ attract attention recently because they can be computed in $O(m)$ work when $A$ is tridiagonal. In this paper, we focus on these bounds and investigate their properties in detail. First, we consider the problem of finding the optimal bound that can be computed solely from ${\rm Tr}(A^{-1})$ and ${\rm Tr}(A^{-2})$ and show that so called Laguerre's lower bound is the optimal one in terms of sharpness. Next, we study the gap between the Laguerre bound and the smallest eigenvalue. We characterize the situation in which the gap becomes largest in terms of the eigenvalue distribution of $A$ and show that the gap becomes smallest when ${\rm Tr}(A^{-2})/\{{\rm Tr}(A^{-1})\}^2$ approaches 1 or $\frac{1}{m}$. These results will be useful, for example, in designing efficient shift strategies for singular value computation algorithms.
15.0NAApr 28
On solving nonlinear simultaneous equations arising from the double-exponential Sinc-collocation method for initial value problemsYusaku Yamamoto, Ken'ichiro Tanaka
The double-exponential Sinc-collocation method is known as a super-accurate method for solving initial value problems of ordinary differential equations, for which the error decreases almost exponentially as a function of the number of sample points in the temporal direction, $N$. However, this method requires solving nonlinear simultaneous equations in $nN$ variables when the problem dimension is $n$. Recently, Ogata pointed out that Gauss-Seidel type fixed-point iteration works surprisingly well for solving these equations, typically reducing the error by one or two orders of magnitude at each iteration. In this paper, we analyze the convergence of this iteration and give a sufficient condition for its global convergence. We also provide an upper bound on its convergence factor, which explains the efficiency of this iteration. Some numerical examples that illustrate the validity of our analysis are also provided.
COMP-PHDec 19, 2018Code
EigenKernel - A middleware for parallel generalized eigenvalue solvers to attain high scalability and usabilityKazuyuki Tanaka, Hiroto Imachi, Tomoya Fukumoto et al.
An open-source middleware EigenKernel was developed for use with parallel generalized eigenvalue solvers or large-scale electronic state calculation to attain high scalability and usability. The middleware enables the users to choose the optimal solver, among the three parallel eigenvalue libraries of ScaLAPACK, ELPA, EigenExa and hybrid solvers constructed from them, according to the problem specification and the target architecture. The benchmark was carried out on the Oakforest-PACS supercomputer and reveals that ELPA, EigenExa and their hybrid solvers show better performance, when compared with pure ScaLAPACK solvers. The benchmark on the K computer is also used for discussion. In addition, a preliminary research for the performance prediction was investigated, so as to predict the elapsed time T as the function of the number of used nodes P (T=T(P)). The prediction is based on Bayesian inference using the Markov Chain Monte Carlo (MCMC) method and the test calculation indicates that the method is applicable not only to performance interpolation but also to extrapolation. Such a middleware is of crucial importance for application-algorithm-architecture co-design among the current, next-generation (exascale), and future-generation (post-Moore era) supercomputers.
NASep 28, 2018
Shifted CholeskyQR for computing the QR factorization of ill-conditioned matricesTakeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa et al.
The Cholesky QR algorithm is an efficient communication-minimizing algorithm for computing the QR factorization of a tall-skinny matrix. Unfortunately it has the inherent numerical instability and breakdown when the matrix is ill-conditioned. A recent work establishes that the instability can be cured by repeating the algorithm twice (called CholeskyQR2). However, the applicability of CholeskyQR2 is still limited by the requirement that the Cholesky factorization of the Gram matrix runs to completion, which means it does not always work for matrices $X$ with $κ_2(X)\gtrsim {\bf u}^{-\frac{1}{2}}$ where ${\bf u}$ is the unit roundoff. In this work we extend the applicability to $κ_2(X)=\mathcal{O}({\bf u}^{-1})$ by introducing a shift to the computed Gram matrix so as to guarantee the Cholesky factorization $R^TR= A^TA+sI$ succeeds numerically. We show that the computed $AR^{-1}$ has reduced condition number $\leq {\bf u}^{-\frac{1}{2}}$, for which CholeskyQR2 safely computes the QR factorization, yielding a computed $Q$ of orthogonality $\|Q^TQ-I\|_2$ and residual $\|A-QR\|_F/\|A\|_F$ both $\mathcal{O}({\bf u})$. Thus we obtain the required QR factorization by essentially running Cholesky QR thrice. We extensively analyze the resulting algorithm shiftedCholeskyQR to reveal its excellent numerical stability. shiftedCholeskyQR is also highly parallelizable, and applicable and effective also when working in an oblique inner product space. We illustrate our findings through experiments, in which we achieve significant (up to x40) speedup over alternative methods.