GPU Performance Portability needs Autotuning
This addresses the issue of vendor lock-in and barriers for new AI hardware, enabling portable LLM inference without code changes, though it is incremental as it builds on existing autotuning and compilation techniques.
The paper tackles the problem of limited GPU performance portability for LLM inference by proposing a combination of JIT compilation and kernel parameter autotuning, resulting in up to 230% performance improvement over vendor-optimized implementations and a 70x reduction in kernel code size.
As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.