CVLGMar 20, 2024

Rotary Position Embedding for Vision Transformer

arXiv:2403.13298v2218 citationsh-index: 38Has CodeECCV
Originality Incremental advance
AI Analysis

This work addresses performance bottlenecks in ViTs for computer vision applications, offering a practical method to improve backbone models with incremental gains.

The study applied Rotary Position Embedding (RoPE) to Vision Transformers (ViTs) to enhance performance in computer vision tasks, achieving improvements in ImageNet-1k, COCO detection, and ADE-20k segmentation with minimal computational overhead.

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable of enhancing Vision Transformer (ViT) performance in a way similar to the language domain. This study provides a comprehensive analysis of RoPE when applied to ViTs, utilizing practical implementations of RoPE for 2D vision data. The analysis reveals that RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference. It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation. We believe this study provides thorough guidelines to apply RoPE into ViT, promising improved backbone performance with minimal extra computational overhead. Our code and pre-trained models are available at https://github.com/naver-ai/rope-vit

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes