PFAILGOSJan 22

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

arXiv:2601.16032v2h-index: 2
Originality Incremental advance
AI Analysis

This work addresses performance bottlenecks in attention kernels for Large Language Models, though it is incremental as it builds on existing Flash Attention methods.

The paper tackled the problem of L2 cache misses in CuTile-based Flash Attention on NVIDIA GB10, resulting in a 50% or greater reduction in L2 misses and up to 60% increase in throughput.

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes