FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
For deep learning practitioners, this framework makes theoretically faster matrix multiplication practical across diverse hardware, enabling performance gains in LLM training and inference.
FalconGEMM is a cross-platform framework that automates deployment and optimization of lower-complexity matrix multiplication algorithms, achieving 7.59%-17.85% speedup over cuBLAS and 12.41%-55.61% over AlphaTensor on LLM workloads across GPU and CPU architectures.
Peak breaking Matrix Multiplication is a promising technique to improve the performance of DL, especially in LLM training and inference. We present FalconGEMM, a cross-platform framework that automates the deployment, optimization, and selection of Lower-Complexity Matrix Multiplication Algorithms (LCMAs) across diverse hardware. There are three key innovations: (1) a Deployment Module that enables portable execution across various hardware and input configurations through code generation; (2) an Execution Module with Group-Parallel Optimizations that maximizes on-chip data reuse, utilizes parallel resources, and reduces bandwidth overhead; and (3) a Decision Module featuring a lightweight analytical performance model to select the optimal strategy based on matrix shapes and hardware profiles. Extensive evaluation is conducted on LLM workloads across GPU (H20, A100) and CPU (ARM, x86) architectures with multiple data types. FalconGEMM succeeds in delivering peak breaking performance and outperforms GEMM libraries (e.g., cuBLAS, CUTLASS, Intel MKL, etc) by 7.59%-17.85% and LCMA competitors like AlphaTensor by 12.41%-55.61%. Our framework makes the theoretical promise of LCMAs practical for production deployment across the heterogeneous landscape of modern hardware.