LGJun 12, 2024

Learning interpretable positional encodings in transformers depends on initialization

Takuya Ito, Luca Cocchi, Tim Klinger, Parikshit Ram, Murray Campbell, Luke Hearne

arXiv:2406.08272v46.43 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of applying transformers to real-world datasets with non-trivial positional structures, such as multi-dimensional or unknown ground truth positions, though it is incremental in focusing on initialization effects.

The study tackled the problem of learning interpretable positional encodings in transformers for tasks with complex positional arrangements, finding that initialization from a small-norm distribution enables the learning of interpretable PEs that improve generalization.

In transformers, the positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known. Here we find that the choice of initialization of a learnable PE greatly influences its ability to learn interpretable PEs that lead to enhanced generalization. We empirically demonstrate our findings in three experiments: 1) A 2D relational reasoning task; 2) A nonlinear stochastic network simulation; 3) A real world 3D neuroscience dataset, applying interpretability analyses to verify the learning of accurate PEs. Overall, we find that a learned PE initialized from a small-norm distribution can 1) uncover interpretable PEs that mirror ground truth positions in multiple dimensions, and 2) lead to improved generalization. These results illustrate the feasibility of learning identifiable and interpretable PEs for enhanced generalization.

View on arXiv PDF

Similar