CL LGMar 20, 2022

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

arXiv:2204.03479v10.87 citationsh-index: 42

Originality Incremental advance

AI Analysis

This addresses the problem of deploying Transformers on resource-constrained edge devices, though it is incremental as it builds on existing pruning techniques.

They tackled the high computational cost of Transformer self-attention for edge devices by proposing a dynamic pruning method, achieving an 80% reduction in operations while maintaining 98.4% accuracy on keyword spotting.

Multi-head self-attention forms the core of Transformer networks. However, their quadratically growing complexity with respect to the input sequence length impedes their deployment on resource-constrained edge devices. We address this challenge by proposing a dynamic pruning method, which exploits the temporal stability of data across tokens to reduce inference cost. The threshold-based method only retains significant differences between the subsequent tokens, effectively reducing the number of multiply-accumulates, as well as the internal tensor data sizes. The approach is evaluated on the Google Speech Commands Dataset for keyword spotting, and the performance is compared against the baseline Keyword Transformer. Our experiments show that we can reduce ~80% of operations while maintaining the original 98.4% accuracy. Moreover, a reduction of ~87-94% operations can be achieved when only degrading the accuracy by 1-4%, speeding up the multi-head self-attention inference by a factor of ~7.5-16.

View on arXiv PDF

Similar