Yekai Pan

88.6PLApr 6Code

AutoLALA: Automatic Loop Algebraic Locality Analysis for AI and HPC Kernels

Yifan Zhu, Yekai Pan, Yanghui Wu et al.

Data movement is the primary bottleneck in modern computing systems. For loop-based programs common in high-performance computing (HPC) and AI workloads, including matrix multiplication, tensor contraction, stencil computation, and einsum operations, the cost of moving data through the memory hierarchy often exceeds the cost of arithmetic. This paper presents AutoLALA, an open-source tool that analyzes data locality in affine loop programs. The tool accepts programs written in a small domain-specific language (DSL), lowers them to polyhedral sets and maps, and produces closed-form symbolic formulas for reuse distance and data movement complexity. AutoLALA implements the fully symbolic locality analysis of Zhu et al. together with the data movement distance (DMD) framework of Smith et al. In particular, it computes reuse distance as the image of the access space under the access map, avoiding both stack simulation and Denning's recursive working-set formulation. We describe the DSL syntax and its formal semantics, the polyhedral lowering pipeline that constructs timestamp spaces and access maps via affine transformations, and the sequence of Barvinok counting operations used to derive symbolic reuse-interval and reuse-distance distributions. The system is implemented in Rust as a modular library spanning three crates, with safe bindings to the Barvinok library. We provide both a command-line interface and an interactive web playground with LaTeX rendering of the output formulas. The tool handles arbitrary affine loop nests, covering workloads such as tensor contractions, einsum expressions, stencil computations, and general polyhedral programs.

51.8PLMar 10

Fully Symbolic Analysis of Loop Locality: Using Imaginary Reuse to Infer Real Performance

Yifan Zhu, Yekai Pan, Chen Ding et al.

This paper presents a new theory of locality and its compiler support. The theory is fully symbolic and derives locality as polynomials, and the compiler analysis supports affine loop nests. They derive cache-performance scaling in quadratic and reciprocal expressions and are more general and precise than empirical scaling rules. Evaluated on a benchmark suite of 41 scientific kernels and tensor operations, the compiler requires an average of 41 seconds to derive the locality polynomials. After derivation, predicting the cache miss count for any given input size and cache configuration takes less than a millisecond. Across all tests--with and without loop fusion--the accuracy in the data movement prediction is 99.6\%, compared to simulated set-associative L1 data cache.

PFJan 22

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

Yifan Zhu, Yekai Pan, Chen Ding

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50\% or greater reduction in L2 misses and up to 60\% increase in throughput on GB10.

Yekai Pan

3 Papers