SwizzlePerf: Hardware-Aware LLMs for GPU Kernel Performance Optimization
This addresses the problem of inefficient GPU kernel optimization for ML and science applications, offering a systematic method to reduce expert effort from weeks to minutes, though it is incremental in applying LLMs to a specific domain.
The paper tackled GPU kernel performance optimization by introducing SwizzlePerf, a hardware-aware LLM approach that automatically generates spatial optimizations, resulting in up to a 2.06x speedup and 70% improvement in L2 hit rate for diverse kernels.
Large language models (LLMs) have shown progress in GPU kernel performance engineering using inefficient search-based methods that optimize around runtime. Any existing approach lacks a key characteristic that human performance engineers rely on for near-optimal utilization -- hardware-awareness. By leveraging the workload's specific memory access patterns, architecture specifications, filtered profiling logs, and reflections on historical performance, we can make software-level optimizations that are tailored to the underlying hardware. SwizzlePerf automatically generates spatial optimizations for GPU kernels on disaggregated architectures by giving LLMs explicit hardware-awareness. For a GEMM kernel, SwizzlePerf takes less than 5 minutes to generate the same hardware-specific optimal swizzling pattern that took expert performance engineers 2 weeks to find. On a suite of 10 diverse ML and Science kernels, SwizzlePerf can generate swizzling patterns for 9 of the kernels that achieve up to a 2.06x speedup and 70% improvement in L2 hit rate. This work is the first of many steps toward systematically creating hardware-aware LLM performance engineering agents.