PL AR LGDec 19, 2025

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Rupanshu Soi, Rohan Yadav, Fredrik Kjolstad, Alex Aiken, Maryam Mehri Dehnavi, Michael Garland, Michael Bauer

arXiv:2512.18134v15.14 citationsh-index: 15

Originality Highly original

AI Analysis

This addresses the problem of inefficient GPU resource utilization for programmers and compilers by providing a heuristic-free, extensible solution, though it is incremental as it builds on existing transformations.

The paper tackles the challenge of optimally combining software pipelining and warp specialization for Tensor Core GPUs by introducing a novel joint optimization formulation solved with constraint solvers, resulting in Twill, a system that automatically derives optimal schedules and proves the optimality of expert-designed schedules for Flash Attention on NVIDIA GPUs.

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

View on arXiv PDF

Similar