25.5NAMar 27
Analysis of Floating-Point Matrix Multiplication Computed via Integer ArithmeticAhmad Abdelfattah, Jack Dongarra, Massimiliano Fasi et al.
Ootomo, Ozaki, and Yokota [Int. J. High Perform. Comput. Appl., 38 (2024), p. 297-313] have proposed a strategy to recast a floating-point matrix multiplication in terms of integer matrix products. The factors A and B are split into integer slices, the product of these slices is computed exactly, and AB is approximated by accumulating these integer products in floating-point arithmetic. This technique is particularly well suited to mixed-precision matrix multiply-accumulate units with integer support, such as the NVIDIA tensor cores or the AMD matrix cores. The number of slices allows for performance-accuracy tradeoffs: more slices yield better accuracy but require more multiplications, which in turn reduce performance. We propose an inexpensive way to estimate the minimum number of multiplications needed to achieve a prescribed level of accuracy. Our error analysis shows that the algorithm may become inaccurate (or inefficient) if rows of A or columns of B are badly scaled. We perform a range of numerical experiments, both in simulation and on the latest NVIDIA GPUs, that confirm the analysis and illustrate strengths and weaknesses of the algorithm.
42.7MSApr 4
Accurate Models of NVIDIA Tensor CoresFaizan A. Khattak, Mantas Mikaitis
Matrix multiplication is a fundamental operation in both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers -- such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others -- test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behaviour of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.
47.6NAMar 25
Probabilistic Error Analysis of Limited-Precision Stochastic Rounding: Horner's Algorithm and Pairwise SummationEl-Mehdi El Arar, Massimiliano Fasi, Silviu-Ioan Filip et al.
Stochastic rounding (SR) is a probabilistic rounding mode that mitigates errors in large-scale numerical computations, especially when prone to stagnation effects. Beyond numerical analysis, SR has shown significant benefits in practical applications such as deep learning and climate modelling. The definition of classical SR requires that results of arithmetic operations are known with infinite precision. This is often not possible, and when it is, the resulting hardware implementation can become prohibitively expensive in terms of energy, area, and latency. A more practical alternative is limited-precision SR, which only requires that the outputs of arithmetic operations are available in higher, finite, precision. We extend previous work on limited-precision SR presented in [El Arar et al., SIAM J. Sci. Comput. 47(5) (2025), B1227-B1249], which developed a framework to evaluate the trade-off between accuracy and hardware resource cost in SR implementations. Within this framework, we study the Horner algorithm and pairwise summation, providing both theoretical insights and practical experiments in these settings when using limited-precision SR.