ThunderKittens: Simple, Fast, and Adorable AI Kernels
This addresses a critical performance problem for AI developers and researchers by providing a simpler and faster way to write GPU kernels, though it appears incremental as it builds on existing kernel optimization efforts.
The paper tackles the bottleneck of mapping AI architectures to GPU hardware by introducing ThunderKittens, a framework that simplifies writing performant AI kernels with key abstractions, resulting in kernels that match or outperform prior ones, including 10-40% gains on attention backwards, 8x on state space models, and 14x on linear attention.
The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on well-established operations like linear attention. The diverse hardware capabilities of GPUs might suggest that we need a wide variety of techniques to achieve high performance. However, our work explores whether a small number of key abstractions can drastically simplify the process. We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain. Our abstractions map to the three levels of the GPU hierarchy: (1) at the warp-level, we provide 16x16 matrix tiles as basic data structures and PyTorch-like parallel compute operations over tiles, (2) at the thread-block level, we provide a template for overlapping asynchronous operations across parallel warps, and (3) at the grid-level, we provide support to help hide the block launch and tear-down, and memory costs. We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations. We match CuBLAS and FlashAttention-3 on GEMM and attention inference performance and outperform the strongest baselines by $10-40\%$ on attention backwards, $8\times$ on state space models, and $14\times$ on linear attention.