SCALE-Sim TPU: Validating and Extending SCALE-Sim for TPUs
This work addresses the need for more accurate and practical performance analysis tools for researchers and engineers working with TPU accelerators, though it is incremental as it builds on an existing simulator.
The paper tackled the problem of inaccurate and limited cycle-accurate simulators for systolic accelerators by developing SCALE-Sim TPU, a validated and extended version for TPU-style accelerators, resulting in a strong linear correlation with hardware latency (e.g., median errors below 3% for non-systolic operations) and improved simulation fidelity.
Cycle-accurate simulators are widely used to study systolic accelerators, yet their accuracy and usability are often limited by weak validation against real hardware and poor integration with modern ML compiler stacks. This paper presents SCALE-Sim TPU, a validated and extended version of SCALE-Sim v3 for TPU-style accelerators. Specifically, we make three contributions: (1) We validate SCALE-Sim's systolic GEMM model against measurements on Google TPU v4 and show that simulated cycle counts exhibit a strong linear correlation with hardware latency, enabling a simple cycle-to-latency mapping. (2) We introduce lightweight learned latency models for non-systolic elementwise operations, achieving median relative errors below 3 percent using only tensor size and shape, substantially improving end-to-end latency estimation. (3) We integrate a StableHLO-based frontend that allows workloads from modern ML frameworks such as JAX and PyTorch to be simulated directly via a unified compiler IR. Together, these contributions improve the fidelity, coverage, and practicality of cycle-accurate simulation for whole-model performance analysis on TPUs.