Massimiliano Fasi

5papers

6citations

Novelty38%

AI Score49

Ranked #45,865 of 201,326 authors (top 23%)#75 in NA (top 13%)

5 Papers

1.7NAMay 22

Computing matrix functions associated with a Hermitian--definite pencil

Dario A. Bini, Massimiliano Fasi, Bruno Iannazzo

We consider the numerical evaluation of the quantity $Af(A^{-1}B)$, where $A$ is Hermitian positive definite, $B$ is Hermitian, and $f$ is a function defined on the spectrum of $A^{-1}B$. This problem is related to the Hermitian-definite matrix pencil $B-λA$. We study the conditioning of the problem, and we introduce several algorithms that combine the Schur decomposition with either the matrix square root or the Cholesky factorization. We study the numerical behavior of these algorithms in floating-point arithmetic, assess their computational costs, and compare their numerical performance. Our analysis suggests that the algorithms based on the Cholesky factorization will be more accurate and efficient than those based on the matrix square root. This is confirmed by our numerical experiments.

27.0MSApr 9Code

An Efficient Batch Solver for the Singular Value Decomposition on GPUs

Ahmad Abdelfattah, Massimiliano Fasi

The singular value decomposition (SVD) is a powerful tool in modern numerical linear algebra, which underpins computational methods such as principal component analysis (PCA), low-rank approximations, and randomized algorithms. Many practical scenarios require solving numerous small SVD problems, a regime generally referred to as "batch SVD". Existing programming models can handle this efficiently on parallel CPU architectures, but high-performance solutions for GPUs remain immature. A GPU-oriented batch SVD solver is introduced. This solver exploits the one-sided Jacobi algorithm to exploit fine-grained parallelism, and a number of algorithmic and design optimizations achieve unmatched performance. Starting from a baseline solver, a sequence of optimizations is applied to obtain incremental performance gains. Numerical experiments show that the new solver is robust across problems with different numerical properties, matrix shapes, and arithmetic precisions. Performance benchmarks on both NVIDIA and AMD systems show significant performance speedups over vendor solutions as well as existing open-source solvers.

23.5NAMar 27

Analysis of Floating-Point Matrix Multiplication Computed via Integer Arithmetic

Ahmad Abdelfattah, Jack Dongarra, Massimiliano Fasi et al.

Ootomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer slices, the product of these slices is computed exactly, and AB is approximated by accumulating these integer products in floating-point arithmetic. This technique is particularly well suited to mixed-precision matrix multiply-accumulate units with integer support, such as the NVIDIA tensor cores or the AMD matrix cores. The number of slices allows for performance-accuracy tradeoffs: more slices yield better accuracy but require more multiplications, which in turn reduce performance. We propose an inexpensive way to estimate the minimum number of multiplications needed to achieve a prescribed level of accuracy. Our error analysis shows that the algorithm may become inaccurate (or inefficient) if rows of A or columns of B are badly scaled. We perform a range of numerical experiments, both in simulation and on the latest NVIDIA GPUs, that confirm the analysis and illustrate strengths and weaknesses of the algorithm.

74.8NAMar 26

Mixed-precision algorithms for solving the Sylvester matrix equation

Andrii Dmytryshyn, Massimiliano Fasi, Nicholas J. Higham et al.

We consider the solution of the Sylvester equation $AX+XB=C$ in mixed precision. We derive a new iterative refinement scheme to solve perturbed quasi-triangular Sylvester equations; our rounding error analysis provides sufficient conditions for convergence and a bound on the attainable relative residual. We leverage this iterative scheme to solve the general Sylvester equation. The new algorithms compute the Schur decomposition of the coefficient matrices $A$ and $B$ in lower than working precision, use the low-precision Schur factors to obtain an approximate solution to the perturbed quasi-triangular equation, and iteratively refine it to obtain a working-precision solution. In order to solve the original equation to working precision, the unitary Schur factors of the coefficient matrices must be unitary to working precision, but this is not the case if the Schur decomposition is computed in low precision. We propose two effective approaches to address this: one is based on re-orthonormalization in working precision, and the other on explicit inversion of the almost-unitary factors. The two mixed-precision algorithms thus obtained are tested on various Sylvester and Lyapunov equations from the literature. Our numerical experiments show that, for both types of equations, the new algorithms are at least as accurate as existing ones. Our cost analysis, on the other hand, suggests that they would typically be faster than mono-precision alternatives if implemented on hardware that natively supports low precision.

48.3NAMar 25

Probabilistic Error Analysis of Limited-Precision Stochastic Rounding: Horner's Algorithm and Pairwise Summation

El-Mehdi El Arar, Massimiliano Fasi, Silviu-Ioan Filip et al.

Stochastic rounding (SR) is a probabilistic rounding mode that mitigates errors in large-scale numerical computations, especially when prone to stagnation effects. Beyond numerical analysis, SR has shown significant benefits in practical applications such as deep learning and climate modelling. The definition of classical SR requires that results of arithmetic operations are known with infinite precision. This is often not possible, and when it is, the resulting hardware implementation can become prohibitively expensive in terms of energy, area, and latency. A more practical alternative is limited-precision SR, which only requires that the outputs of arithmetic operations are available in higher, finite, precision. We extend previous work on limited-precision SR presented in [El Arar et al., SIAM J. Sci. Comput. 47(5) (2025), B1227-B1249], which developed a framework to evaluate the trade-off between accuracy and hardware resource cost in SR implementations. Within this framework, we study the Horner algorithm and pairwise summation, providing both theoretical insights and practical experiments in these settings when using limited-precision SR.