DC LGJan 14

A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication

Yufan Xia, Marco De La Pierre, Amanda S. Barnard, Giuseppe Maria Junior Barca

arXiv:2601.09114v11.22 citationsh-index: 18IPDPS

Originality Incremental advance

AI Analysis

This work addresses runtime optimization for matrix multiplication in scientific computing, offering incremental improvements in performance for high-performance computing applications.

The paper tackled the challenge of optimizing multi-threaded GEMM runtime on modern multi-core systems by using a machine learning model to automatically select the optimal number of threads, achieving a 25 to 40% speedup compared to traditional BLAS implementations on two HPC architectures.

The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime. We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.

View on arXiv PDF

Similar