AR LGSep 8, 2024

An Analog and Digital Hybrid Attention Accelerator for Transformers with Charge-based In-memory Computing

Ashkan Moradifirouzabadi, Divya Sri Dodla, Mingu Kang

arXiv:2409.04940v24 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses the efficiency problem for hardware acceleration of Transformers, offering an incremental improvement through hybrid analog-digital design.

The paper tackles the high computational and memory access burden of the attention mechanism in Transformers by presenting an analog and digital hybrid processor in 65nm CMOS technology, achieving peak energy efficiencies of 14.8 and 1.65 TOPS/W and peak area efficiencies of 976.6 and 79.4 GOPS/mm² in the analog core and SoC, respectively.

The attention mechanism is a key computing kernel of Transformers, calculating pairwise correlations across the entire input sequence. The computing complexity and frequent memory access in computing self-attention put a huge burden on the system especially when the sequence length increases. This paper presents an analog and digital hybrid processor to accelerate the attention mechanism for transformers in 65nm CMOS technology. We propose an analog computing-in-memory (CIM) core, which prunes ~75% of low-score tokens on average during runtime at ultra-low power and delay. Additionally, a digital processor performs precise computations only for ~25% unpruned tokens selected by the analog CIM core, preventing accuracy degradation. Measured results show peak energy efficiency of 14.8 and 1.65 TOPS/W, and peak area efficiency of 976.6 and 79.4 GOPS/mm$^\mathrm{2}$ in the analog core and the system-on-chip (SoC), respectively.

View on arXiv PDF

Similar