ARLGJul 21, 2024

Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation

arXiv:2407.15131v17 citationsh-index: 8
Originality Incremental advance
AI Analysis

This addresses the problem of slow and energy-intensive text generation for AI applications, but it is incremental as it builds on prior token-pruning methods.

The paper tackles the memory bottleneck in attention mechanisms for text generation by estimating probabilities before softmax to prune low-probability tokens, achieving a 12.1x pruning ratio, 2.6x reduced memory accesses, 2.3x speedup, and 2.4x energy efficiency.

The attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes