CVDec 4, 2023

Rejuvenating image-GPT as Strong Visual Representation Learners

arXiv:2312.02147v218 citationsh-index: 35Has CodeICML
Originality Highly original
AI Analysis

This work addresses the need for more effective visual representation learners in computer vision, offering a novel approach that enhances performance on standard benchmarks.

The paper tackles the problem of improving autoregressive pretraining for visual representation learning by shifting from raw pixel prediction to semantic tokens and adding visible token prediction, achieving 90.0% top-1 accuracy on ImageNet-1K with a ViT-H model.

This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the prediction target from raw pixels to semantic tokens, enabling a higher-level understanding of visual content. Second, we supplement the autoregressive modeling by instructing the model to predict not only the next tokens but also the visible tokens. This pipeline is particularly effective when semantic tokens are encoded by discriminatively trained models, such as CLIP. We introduce this novel approach as D-iGPT. Extensive experiments showcase that D-iGPT excels as a strong learner of visual representations: A notable achievement is its compelling performance on the ImageNet-1K dataset -- by training on publicly available datasets, D-iGPT unprecedentedly achieves \textbf{90.0\%} top-1 accuracy with a vanilla ViT-H. Additionally, D-iGPT shows strong generalization on the downstream task. Code is available at https://github.com/OliverRensu/D-iGPT.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes