PFApr 5

Shortest-Path FFT: Optimal SIMD Instruction Scheduling via Graph Search

arXiv:2604.043119.8
Predicted impact top 53% in PF · last 90 daysOriginality Highly original
AI Analysis

This addresses the optimization of SIMD instruction scheduling for FFT computations, particularly for hardware like Apple M1, though it is incremental as it builds on prior work like FFTW.

The paper tackles the problem of finding the fastest implementation of an FFT by modeling it as a shortest-path problem on a directed acyclic graph, introducing context-aware and context-free variants, and achieves a 34% speed improvement over context-free methods on Apple M1 NEON, reaching 29.8 GFLOPS.

An $N$-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the \emph{context-free} model, nodes represent computation stages and edge weights are independently measured instruction costs. In the \emph{context-aware} model, nodes are expanded to encode the \emph{predecessor edge type}, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW \citep{FrigoJohnson1998}: that optimal-substructure assumptions break down ``because of the different states of the cache.'' Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal). The context-aware Dijkstra discovers $\text{R4} \to \text{R2} \to \text{R4} \to \text{R4} \to \text{Fused-8}$ at 29.8~GFLOPS -- a $5.2\times$ improvement over pure radix-2 and 34\% faster than the context-free result. This arrangement includes a radix-2 pass \emph{sandwiched between} radix-4 passes, exploiting cache residuals that only exist in context. No context-free search can discover this.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes