LGNov 11, 2025

HipKittens: Fast and Furious AMD Kernels

William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, Simran Arora

arXiv:2511.08083v17.11 citationsh-index: 11Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of simplifying and accelerating AI kernel development for AMD GPUs, offering a vendor-agnostic approach that is incremental but impactful for AI practitioners and hardware diversity.

The paper tackled the challenge of developing high-performance AI kernels for AMD GPUs by introducing HipKittens, a programming framework that adapts tile-based abstractions from NVIDIA-focused DSLs, resulting in kernels that compete with hand-optimized assembly and outperform compiler baselines by up to 2.4x in specific settings like attention and memory-bound operations.

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4\times$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.

View on arXiv PDF Code

Similar