TiledAttention: a CUDA Tile SDPA Kernel for PyTorch
This provides a practical tool for SDPA research on NVIDIA GPUs, balancing performance and customizability for researchers working on attention mechanisms.
The paper introduces TiledAttention, a CUDA tile kernel for scaled dot-product attention in PyTorch that enables easier modification and realistic behavior through online softmax and tiled streaming, achieving large speedups over standard eager attention paths in benchmarks on an NVIDIA DGX GB10 node.
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch) and explicit unfused baselines across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.