CV AI LGMar 14, 2023

AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Xiao Wang, Ying Wang, Ziwei Xuan, Guo-Jun Qi

arXiv:2303.07598v13.94 citationsh-index: 63Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of preventing trivial feature learning in unsupervised vision transformer pretraining, though it appears incremental as an enhancement to existing MIM methods like MAE.

The paper tackles the problem of unsupervised pretraining for vision transformers by proposing Adversarial Positional Embedding (AdPE), which perturbs position encodings to make the pretext task harder and force learning of more discriminative features. Results show improvements of 0.8% and 0.4% in fine-tuning accuracy on ImageNet1K for ViT-B and ViT-L, and gains of 2.6% mIoU on ADE20K and up to 3.2% AP on COCO for transfer learning.

Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a pretext task. A criterion in unsupervised pretraining is the pretext task needs to be sufficiently hard to prevent the transformer encoder from learning trivial low-level features not generalizable well to downstream tasks. For this purpose, we propose an Adversarial Positional Embedding (AdPE) approach -- It distorts the local visual structures by perturbing the position encodings so that the learned transformer cannot simply use the locally correlated patches to predict the missing ones. We hypothesize that it forces the transformer encoder to learn more discriminative features in a global context with stronger generalizability to downstream tasks. We will consider both absolute and relative positional encodings, where adversarial positions can be imposed both in the embedding mode and the coordinate mode. We will also present a new MAE+ baseline that brings the performance of the MIM pretraining to a new level with the AdPE. The experiments demonstrate that our approach can improve the fine-tuning accuracy of MAE by $0.8\%$ and $0.4\%$ over 1600 epochs of pretraining ViT-B and ViT-L on Imagenet1K. For the transfer learning task, it outperforms the MAE with the ViT-B backbone by $2.6\%$ in mIoU on ADE20K, and by $3.2\%$ in AP$^{bbox}$ and $1.6\%$ in AP$^{mask}$ on COCO, respectively. These results are obtained with the AdPE being a pure MIM approach that does not use any extra models or external datasets for pretraining. The code is available at https://github.com/maple-research-lab/AdPE.

View on arXiv PDF Code

Similar