CVLGJul 18, 2024

Keypoint Aware Masked Image Modelling

arXiv:2407.13873v31 citationsh-index: 4Has Code
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in masked image modeling for vision transformers, offering incremental improvements for researchers and practitioners in computer vision.

The paper tackles the suboptimal linear probing performance of SimMIM in vision transformer pretraining by proposing KAMIM, which uses keypoint-aware patch-wise weighting to improve context during reconstruction, resulting in a linear probing accuracy increase from 16.12% to 33.97% and fine-tuning accuracy from 76.78% to 77.3% on ImageNet-1K with ViT-B.

SimMIM is a widely used method for pretraining vision transformers using masked image modeling. However, despite its success in fine-tuning performance, it has been shown to perform sub-optimally when used for linear probing. We propose an efficient patch-wise weighting derived from keypoint features which captures the local information and provides better context during SimMIM's reconstruction phase. Our method, KAMIM, improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs. We conduct extensive testing on different datasets, keypoint extractors, and model architectures and observe that patch-wise weighting augments linear probing performance for larger pretraining datasets. We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers. Our code is publicly available at https://github.com/madhava20217/KAMIM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes