CVMar 29, 2022

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan

Tencent

arXiv:2203.15371v418.844 citationsh-index: 53Has Code

Originality Incremental advance

AI Analysis

This work addresses a bottleneck in self-supervised image representation learning for computer vision tasks, offering incremental improvements over existing methods like BEiT.

The paper tackles the problem of improper discretization in masked image modeling for image BERT pre-training by proposing mc-BEiT, which uses multi-choice training objectives with soft probability vectors and inter-patch perceptions. The result includes a ViT-B model achieving 84.1% top-1 accuracy on ImageNet-1K, 49.2% AP^b and 44.0% AP^m on COCO detection and segmentation, and 50.8% mIOU on ADE20K segmentation.

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning. A seminal work, BEiT, casts MIM as a classification task with a visual vocabulary, tokenizing the continuous visual signals into discrete vision tokens using a pre-learned dVAE. Despite a feasible solution, the improper discretization hinders further improvements of image pre-training. Since image discretization has no ground-truth answers, we believe that the masked patch should not be assigned with a unique token id even if a better tokenizer can be obtained. In this work, we introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives. Specifically, the multi-choice supervision for the masked image patches is formed by the soft probability vectors of the discrete token ids, which are predicted by the off-the-shelf image tokenizer and further refined by high-level inter-patch perceptions resorting to the observation that similar patches should share their choices. Extensive experiments on classification, segmentation, and detection tasks demonstrate the superiority of our method, e.g., the pre-trained ViT-B achieves 84.1% top-1 fine-tuning accuracy on ImageNet-1K classification, 49.2% AP^b and 44.0% AP^m of object detection and instance segmentation on COCO, 50.8% mIOU on ADE20K semantic segmentation, outperforming the competitive counterparts. The code will be available at https://github.com/lixiaotong97/mc-BEiT.

View on arXiv PDF Code

Similar