ARMay 1

Sim-FA: A Simulator Frontend for Asynchronous Pipelines

arXiv:2605.0055567.9
AI Analysis

For researchers in AI infrastructure and computer architecture, this work provides a more accurate simulation tool for asynchronous pipelines in LLM workloads, though it is an incremental improvement over existing methods.

The paper presents Sim-FA, a simulation pipeline for FlashAttention-3 that achieves 5.7% mean absolute percentage error and 12.7% maximum error against H800 hardware, addressing the lack of support for new GPU features like TMA in existing simulators.

To efficiently support Large Language Models (LLMs), modern GPGPU architectures have introduced new features and programming paradigms, such as warp specialization. These features enable temporal overlap between the producer and consumer, as well as between matrix multiplication and activation function operations, substantially improving performance. To conduct effective AI infrastructure and computer architecture research, cycle-accurate simulators that support these new features, together with analytical models that faithfully capture workload characteristics, are essential. However, existing academic tools provide limited support for these emerging requirements. Existing cycle-accurate simulators do not incorporate new NVIDIA GPU features, such as the Tensor Memory Accelerator (TMA), in a timely manner. Moreover, existing analytical models can misestimate DRAM traffic under certain configurations. In this paper, we build a simulation pipeline from FlashAttention-3 kernel instrumentation to cycle-accurate simulation. The simulator achieves a mean absolute percentage error (MAPE) of 5.7\% and a maximum absolute percentage error of 12.7\% against H800. We also provide a theoretical analysis of FlashAttention-3 and explain why existing analytical models can produce inaccurate traffic estimates.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes