LGFeb 11, 2025

ESPFormer: Doubly-Stochastic Attention with Expected Sliced Transport Plans

arXiv:2502.07962v211 citationsh-index: 21Has CodeICML
Originality Incremental advance
AI Analysis

This work addresses efficiency and structural issues in attention mechanisms for deep learning models, offering a domain-specific improvement for tasks such as NLP and computer vision.

The paper tackled the problem of over-concentration in self-attention mechanisms by introducing a novel doubly-stochastic attention method based on sliced optimal transport, which improved performance across multiple benchmarks like image classification and machine translation without iterative Sinkhorn normalization.

While self-attention has been instrumental in the success of Transformers, it can lead to over-concentration on a few tokens during training, resulting in suboptimal information flow. Enforcing doubly-stochastic constraints in attention matrices has been shown to improve structure and balance in attention distributions. However, existing methods rely on iterative Sinkhorn normalization, which is computationally costly. In this paper, we introduce a novel, fully parallelizable doubly-stochastic attention mechanism based on sliced optimal transport, leveraging Expected Sliced Transport Plans (ESP). Unlike prior approaches, our method enforces doubly stochasticity without iterative Sinkhorn normalization, significantly enhancing efficiency. To ensure differentiability, we incorporate a temperature-based soft sorting technique, enabling seamless integration into deep learning models. Experiments across multiple benchmark datasets, including image classification, point cloud classification, sentiment analysis, and neural machine translation, demonstrate that our enhanced attention regularization consistently improves performance across diverse applications. Our implementation code can be found at https://github.com/dariansal/ESPFormer.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes