CVDec 10, 2022

Position Embedding Needs an Independent Layer Normalization

Peking U
arXiv:2212.05262v21 citationsh-index: 24Has Code
AI Analysis

This addresses a performance bottleneck in Vision Transformers for computer vision tasks, offering a simple and effective enhancement.

The paper tackles the limitation of position embedding in Vision Transformers by proposing Layer-adaptive Position Embedding (LaPE), which uses independent layer normalizations for token and position embeddings, resulting in accuracy improvements such as 1.72% for DeiT on ImageNet-1K with minimal extra cost.

The Position Embedding (PE) is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operation. By analyzing the input and output of each encoder layer in VTs using reparameterization and visualization, we find that the default PE joining method (simply adding the PE and patch embedding together) operates the same affine transformation to token embedding and PE, which limits the expressiveness of PE and hence constrains the performance of VTs. To overcome this limitation, we propose a simple, effective, and robust method. Specifically, we provide two independent layer normalizations for token embeddings and PE for each layer, and add them together as the input of each layer's Muti-Head Self-Attention module. Since the method allows the model to adaptively adjust the information of PE for different layers, we name it as Layer-adaptive Position Embedding, abbreviated as LaPE. Extensive experiments demonstrate that LaPE can improve various VTs with different types of PE and make VTs robust to PE types. For example, LaPE improves 0.94% accuracy for ViT-Lite on Cifar10, 0.98% for CCT on Cifar100, and 1.72% for DeiT on ImageNet-1K, which is remarkable considering the negligible extra parameters, memory and computational cost brought by LaPE. The code is publicly available at https://github.com/Ingrid725/LaPE.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes