Preconditioned Attention: Enhancing Efficiency in Transformers
For transformer practitioners, this drop-in replacement addresses a fundamental optimization bottleneck, offering broad improvements without architectural changes.
Standard attention mechanisms in transformers produce ill-conditioned matrices that hinder optimization. Preconditioned attention reduces the condition number, improving training efficiency and achieving consistent gains across tasks like image classification, object detection, and language modeling.
Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers often produce ill-conditioned matrices with large condition numbers. This ill-conditioning is a well-known obstacle for gradient-based optimizers, leading to inefficient training. To address this issue, we introduce preconditioned attention, a novel approach that incorporates a conditioning matrix into each attention head. Our theoretical analysis shows that this method significantly reduces the condition number of attention matrices, resulting in better-conditioned matrices that improve optimization. Conditioned attention serves as a simple drop-in replacement for a wide variety of attention mechanisms in the literature. We validate the effectiveness of preconditioned attention across a diverse set of transformer applications, including image classification, object detection, instance segmentation, long sequence modeling and language modeling.