LGAICVJun 5, 2021

Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding

arXiv:2106.02795v3149 citations
Originality Incremental advance
AI Analysis

This addresses the need for effective positional encoding in deep learning models for sequences or images, offering a novel approach that is particularly beneficial for capturing spatial relationships, though it is incremental in nature.

The paper tackles the problem of positional encoding for attention-based models like Transformers by proposing a learnable Fourier feature method that represents multi-dimensional positions as trainable encodings, which improves accuracy and enables faster convergence on several benchmark tasks.

Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where $L_2$ distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes