LGJun 12, 2024

Learning interpretable positional encodings in transformers depends on initialization

arXiv:2406.08272v43 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of applying transformers to real-world datasets with non-trivial positional structures, such as multi-dimensional or unknown ground truth positions, though it is incremental in focusing on initialization effects.

The study tackled the problem of learning interpretable positional encodings in transformers for tasks with complex positional arrangements, finding that initialization from a small-norm distribution enables the learning of interpretable PEs that improve generalization.

In transformers, the positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known. Here we find that the choice of initialization of a learnable PE greatly influences its ability to learn interpretable PEs that lead to enhanced generalization. We empirically demonstrate our findings in three experiments: 1) A 2D relational reasoning task; 2) A nonlinear stochastic network simulation; 3) A real world 3D neuroscience dataset, applying interpretability analyses to verify the learning of accurate PEs. Overall, we find that a learned PE initialized from a small-norm distribution can 1) uncover interpretable PEs that mirror ground truth positions in multiple dimensions, and 2) lead to improved generalization. These results illustrate the feasibility of learning identifiable and interpretable PEs for enhanced generalization.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes