LGAINov 21, 2021

Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism

arXiv:2111.10770v120 citations
Originality Incremental advance
AI Analysis

This work addresses a bottleneck in custom hardware acceleration for modern DNNs, offering a cost-efficient solution for attention-based models, though it is incremental as it builds on existing approximation techniques.

The paper tackles the problem of efficiently implementing softmax layers in deep neural networks with attention mechanisms, proposing two LookUp Table-based approximation methods that achieve acceptable accuracy loss below 1.0% across various AI tasks and models.

There has been a rapid advance of custom hardware (HW) for accelerating the inference speed of deep neural networks (DNNs). Previously, the softmax layer was not a main concern of DNN accelerating HW, because its portion is relatively small in multi-layer perceptron or convolutional neural networks. However, as the attention mechanisms are widely used in various modern DNNs, a cost-efficient implementation of softmax layer is becoming very important. In this paper, we propose two methods to approximate softmax computation, which are based on the usage of LookUp Tables (LUTs). The required size of LUT is quite small (about 700 Bytes) because ranges of numerators and denominators of softmax are stable if normalization is applied to the input. We have validated the proposed technique over different AI tasks (object detection, machine translation, sentiment analysis, and semantic equivalence) and DNN models (DETR, Transformer, BERT) by a variety of benchmarks (COCO17, WMT14, WMT17, GLUE). We showed that 8-bit approximation allows to obtain acceptable accuracy loss below $1.0\%$.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes