CLAILGMar 30, 2022

Transformer Language Models without Positional Encodings Still Learn Positional Information

arXiv:2203.16634v2370 citations
Originality Incremental advance
AI Analysis

This is an incremental finding for the NLP community, suggesting that causal masks alone might provide positional awareness in language models.

The paper tackled the problem of whether transformer language models need explicit positional encodings, showing that models without them remain competitive across various conditions and acquire implicit positional information through causal attention.

Causal transformer language models (LMs), such as GPT-3, typically require some form of positional encoding, such as positional embeddings. However, we show that LMs without any explicit positional encoding are still competitive with standard models, and that this phenomenon is robust across different datasets, model sizes, and sequence lengths. Probing experiments reveal that such models acquire an implicit notion of absolute positions throughout the network, effectively compensating for the missing information. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position. Our findings indicate that causal LMs might derive positional awareness not only from the explicit positioning mechanism, but also from the effects of the causal mask.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes