CVNov 9, 2023

Window Attention is Bugged: How not to Interpolate Position Embeddings

arXiv:2311.05613v121 citationsh-index: 36
Originality Incremental advance
AI Analysis

This addresses a performance bug in widely used transformer components for computer vision, offering a simple fix that improves state-of-the-art models.

The paper identifies a bug where interpolating position embeddings in window attention harms performance in vision transformers, and fixes it with a simple absolute window position embedding strategy, achieving 61.7 box mAP on COCO for models with ImageNet-1k pretraining.

Window attention, position embeddings, and high resolution finetuning are core concepts in the modern transformer era of computer vision. However, we find that naively combining these near ubiquitous components can have a detrimental effect on performance. The issue is simple: interpolating position embeddings while using window attention is wrong. We study two state-of-the-art methods that have these three components, namely Hiera and ViTDet, and find that both do indeed suffer from this bug. To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in ViTDet. We finally combine the two to obtain HieraDet, which achieves 61.7 box mAP on COCO, making it state-of-the-art for models that only use ImageNet-1k pretraining. This all stems from what is essentially a 3 line bug fix, which we name "absolute win".

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes