ARLGJul 7, 2023

ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers

arXiv:2307.03493v225 citationsh-index: 107
AI Analysis

This work addresses energy and area efficiency for transformer inference in embedded systems, representing an incremental improvement over existing accelerators.

The paper tackles the challenge of efficiently accelerating transformer models for embedded systems by proposing ITA, an accelerator that uses 8-bit quantization and an integer-only softmax implementation, achieving 16.9 TOPS/W energy efficiency and 5.93 TOPS/mm² area efficiency.

Transformer networks have emerged as the state-of-the-art approach for natural language processing tasks and are gaining popularity in other domains such as computer vision and audio processing. However, the efficient hardware acceleration of transformer models poses new challenges due to their high arithmetic intensities, large memory requirements, and complex dataflow dependencies. In this work, we propose ITA, a novel accelerator architecture for transformers and related models that targets efficient inference on embedded systems by exploiting 8-bit quantization and an innovative softmax implementation that operates exclusively on integer values. By computing on-the-fly in streaming mode, our softmax implementation minimizes data movement and energy consumption. ITA achieves competitive energy efficiency with respect to state-of-the-art transformer accelerators with 16.9 TOPS/W, while outperforming them in area efficiency with 5.93 TOPS/mm$^2$ in 22 nm fully-depleted silicon-on-insulator technology at 0.8 V.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes