CLAILGJun 3, 2021

The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models

arXiv:2106.01950v1715 citations
Originality Highly original
AI Analysis

This addresses the need for efficient and interpretable positional encoding in language models, offering a novel method that reduces parameter count while maintaining or improving performance.

The paper tackles the problem of encoding positional information in transformer-based language models by analyzing existing position embeddings and finding evidence of translation invariance, which correlates with performance. It proposes translation-invariant self-attention (TISA), which improves ALBERT on GLUE tasks while adding significantly fewer positional parameters.

Mechanisms for encoding positional information are central for transformer-based language models. In this paper, we analyze the position embeddings of existing language models, finding strong evidence of translation invariance, both for the embeddings themselves and for their effect on self-attention. The degree of translation invariance increases during training and correlates positively with model performance. Our findings lead us to propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion without needing conventional position embeddings. Our proposal has several theoretical advantages over existing position-representation approaches. Experiments show that it improves on regular ALBERT on GLUE tasks, while only adding orders of magnitude less positional parameters.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes