CVApr 2, 2024

ViTamin: Designing Scalable Vision Models in the Vision-Language Era

arXiv:2404.02132v262 citationsh-index: 22CVPR
Originality Incremental advance
AI Analysis

This work addresses the need for scalable and efficient vision encoders in vision-language models, offering a comprehensive benchmarking protocol and improved performance for tasks like classification and retrieval, though it is incremental in advancing existing CLIP frameworks.

The paper tackles the problem of evaluating and designing vision models for vision-language models (VLMs) by introducing ViTamin, a new vision model tailored for VLMs, which significantly outperforms Vision Transformers (ViTs) in zero-shot accuracy, achieving a 2.0% improvement on ImageNet with ViTamin-L and 82.9% accuracy with ViTamin-XL using fewer parameters.

Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models, thanks to the training on the large-scale Internet image-text pairs. However, despite the amazing achievement from the VLMs, vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area, it remains questionable whether it is also the case for image encoding, especially considering that various types of networks are proposed on the ImageNet benchmark, which, unfortunately, are rarely studied in VLMs. Due to small data/model scale, the original conclusions of model design on ImageNet can be limited and biased. In this paper, we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models, covering their zero-shot performance and scalability in both model and training data sizes. To this end, we introduce ViTamin, a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy, when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks, including classification, retrieval, open-vocabulary detection and segmentation, and large multi-modal models. When further scaling up the model size, our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy, surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B).

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes