CVJun 27, 2023

CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy

arXiv:2306.15658v114.928 citationsh-index: 35Has Code

Originality Incremental advance

AI Analysis

This work reduces computational costs for training high-performance CLIP models, benefiting researchers and practitioners in computer vision and multimodal AI, though it is incremental as it builds on prior CLIPA findings.

The paper tackles the problem of high computational cost in training CLIP models by applying an inverse scaling law to both training and finetuning stages, achieving 81.1% zero-shot ImageNet accuracy with a $10,000 budget and 81.8% with an additional $4,000, surpassing prior best by 1.0% and reducing cost by ~39X.

The recent work CLIPA presents an inverse scaling law for CLIP training -- whereby the larger the image/text encoders used, the shorter the sequence length of image/text tokens that can be applied in training. This finding enables us to train high-performance CLIP models with significantly reduced computations. Building upon this work, we hereby present CLIPA-v2 with two key contributions. Technically, we find this inverse scaling law is also applicable in the finetuning stage, enabling further reduction in computational needs. Empirically, we explore CLIPA at scale, extending the experiments up to the H/14 model with ~13B image-text pairs seen during training. Our results are exciting -- by only allocating a budget of \$10,000, our CLIP model achieves an impressive zero-shot ImageNet accuracy of 81.1%, surpassing the prior best CLIP model (from OpenCLIP, 80.1%) by 1.0% and meanwhile reducing the computational cost by ~39X. Moreover, with an additional investment of $4,000, we can further elevate the zero-shot ImageNet accuracy to 81.8%. Our code and models are available at https://github.com/UCSC-VLAA/CLIPA.

View on arXiv PDF Code

Similar