LGPFJul 18, 2024

Attention in SRAM on Tenstorrent Grayskull

arXiv:2407.13885v18 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work provides incremental improvements in hardware-specific optimization for AI accelerators, targeting developers and researchers using Tenstorrent devices for efficient attention computations.

The paper tackled the problem of accelerating Transformer self-attention by implementing fused and dedicated kernels on the Tenstorrent Grayskull architecture to leverage its large SRAM, achieving speedups of up to 10x for Softmax and 1.8x within a fused kernel compared to CPU baselines.

When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 \times$, and the Softmax implementation inside the fused kernel is approximately $1.8 \times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 \times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 \times$ more SRAM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes