CVLGNov 24, 2021

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

arXiv:2111.12710v3282 citations
Originality Highly original
AI Analysis

This work addresses the contradiction between current prediction targets and human perception in vision transformer pre-training, leading to better transfer performance in tasks like image classification, object detection, and segmentation.

The paper tackles the problem of improving BERT pre-training for vision transformers by proposing a perceptual codebook that aligns prediction targets with human perception, achieving 84.5% Top-1 accuracy on ImageNet-1K with ViT-B and 88.3% with ViT-H, outperforming BEiT by +1.3%.

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment.This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity.We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve $\textbf{84.5\%}$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by $\textbf{+1.3\%}$ under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (\textbf{88.3\%}) among methods using only ImageNet-1K data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes