LGCLNov 5, 2024

LASER: Attention with Exponential Transformation

arXiv:2411.03493v23 citationsh-index: 4ICML
AI Analysis

This addresses a bottleneck in training efficiency for Transformers across various domains like vision, text, and speech, but it is incremental as it modifies existing attention mechanisms.

The paper tackles the problem of small gradients in softmax-based attention in Transformers, which can lead to inefficient learning, by introducing LASER, a new attention mechanism that analytically provides larger gradient signals and shows average improvements of up to 1.44% on downstream evaluations and 1.65% in fine-tuning for large language models.

Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 7.7 billion parameters with an average improvement of upto 1.44% over standard attention on downstream evaluations and 1.65% finetuning improvements. Additionally, LASER demonstrates generalization performance improvement across a variety of tasks (vision, text and speech):Vision Transformer (ViT) on Imagenet, Conformer on the Librispeech speech-to-text and BERT with 2.2 billion parameters.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes