MSDec 22, 2016
An efficient hybrid tridiagonal divide-and-conquer algorithm on distributed memory architecturesShengguo Li, Francois-Henry Rouet, Jie Liu et al.
In this paper, an efficient divide-and-conquer (DC) algorithm is proposed for the symmetric tridiagonal matrices based on ScaLAPACK and the hierarchically semiseparable (HSS) matrices. HSS is an important type of rank-structured matrices.Most time of the DC algorithm is cost by computing the eigenvectors via the matrix-matrix multiplications (MMM). In our parallel hybrid DC (PHDC) algorithm, MMM is accelerated by using the HSS matrix techniques when the intermediate matrix is large. All the HSS algorithms are done via the package STRUMPACK. PHDC has been tested by using many different matrices. Compared with the DC implementation in MKL, PHDC can be faster for some matrices with few deflations when using hundreds of processes. However, the gains decrease as the number of processes increases. The comparisons of PHDC with ELPA (the Eigenvalue soLvers for Petascale Applications library) are similar. PHDC is usually slower than MKL and ELPA when using 300 or more processes on Tianhe-2 supercomputer.
DCMar 5, 2020
Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based ApproachPeng Zhang, Jianbin Fang, Canqun Yang et al.
This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a performance model to estimate the resulting performance of the target application under a given resource partition and task granularity configuration. The model is used as a utility to quickly search for a good configuration at runtime. Instead of hand-crafting an analytical model that requires expert insights into low-level hardware details, we employ machine learning techniques to automatically learn it. We achieve this by first learning a predictive model offline using training programs. The learnt model can then be used to predict the performance of any unseen program at runtime. We apply our approach to 39 representative parallel applications and evaluate it on two representative heterogeneous many-core platforms: a CPU-XeonPhi platform and a CPU-GPU platform. Compared to the single-stream version, our approach achieves, on average, a 1.6x and 1.1x speedup on the XeonPhi and the GPU platform, respectively. These results translate to over 93% of the performance delivered by a theoretically perfect predictor.
OCNov 5, 2018
Non-ergodic Convergence Analysis of Heavy-Ball AlgorithmsTao Sun, Penghang Yin, Dongsheng Li et al.
In this paper, we revisit the convergence of the Heavy-ball method, and present improved convergence complexity results in the convex setting. We provide the first non-ergodic O(1/k) rate result of the Heavy-ball algorithm with constant step size for coercive objective functions. For objective functions satisfying a relaxed strongly convex condition, the linear convergence is established under weaker assumptions on the step size and inertial parameter than made in the existing literature. We extend our results to multi-block version of the algorithm with both the cyclic and stochastic update rules. In addition, our results can also be extended to decentralized optimization, where the ergodic analysis is not applicable.
DCMar 19, 2014
MPISE: Symbolic Execution of MPI ProgramsXianjin Fu, Zhenbang Chen, Yufeng Zhang et al.
Message Passing Interfaces (MPI) plays an important role in parallel computing. Many parallel applications are implemented as MPI programs. The existing methods of bug detection for MPI programs have the shortage of providing both input and non-determinism coverage, leading to missed bugs. In this paper, we employ symbolic execution to ensure the input coverage, and propose an on-the-fly schedule algorithm to reduce the interleaving explorations for non-determinism coverage, while ensuring the soundness and completeness. We have implemented our approach as a tool, called MPISE, which can automatically detect the deadlock and runtime bugs in MPI programs. The results of the experiments on benchmark programs and real world MPI programs indicate that MPISE finds bugs effectively and efficiently. In addition, our tool also provides diagnostic information and replay mechanism to help understanding bugs.