AR AIMay 20

SA-Kura: An Energy-Efficient Systolic Array Accelerator for Locally-Coupled Kuramoto Drift in Diffusion Sampling

Jeongmin Jin, Kyeongwon Lee, Mundo Jeong, Jongin Choi, Woojoo Lee

arXiv:2605.2401639.3

Predicted impact top 44% in AR · last 90 daysOriginality Incremental advance

AI Analysis

For edge deployment of diffusion models, this work provides a hardware accelerator that makes a previously inefficient kernel practical, though the problem is domain-specific.

The paper introduces SA-Kura, the first digital systolic-array accelerator for locally-coupled Kuramoto drift in diffusion sampling, achieving 193x latency reduction and 69.4x energy reduction over software, and 6.57x speedup with 46.0x lower energy per pixel compared to a Jetson Orin Nano GPU.

Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standard drift is merely a trivial linear scaling. Kuramoto orientation diffusion replaces this trivial drift with locally coupled phase interactions, improving sampling efficiency but introducing a new hardware bottleneck: a center-dependent nonlinear 5 x 5 stencil evaluated at every reverse step. This kernel maps poorly to conventional CNN accelerators and matrix-oriented engines. We present SA-Kura, to our knowledge the first digital systolic-array accelerator dedicated to locally coupled Kuramoto drift. By reformulating pair-wise sinusoidal coupling into neighbor accumulation independent of the center phase followed by a single center-dependent multiply-subtract combination, SA-Kura eliminates in-PE transcendental units and enables regular systolic execution with register-level reuse. SA-Kura was implemented in synthesizable RTL, integrated into a lightweight RISC-V-based SoC, prototyped on FPGA, and evaluated through 45 nm CMOS synthesis and power analysis. For the drift kernel only, compared with software execution of the same kernel on the processor core in the same SoC platform, SA-Kura reduces latency and energy by 193x and 69.4x, respectively. Compared with a standalone Jetson Orin Nano CUDA implementation of the same kernel, it is 6.57x faster and achieves approximately 46.0x lower energy per pixel.

View on arXiv PDF

Similar