CVMar 27, 2023

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Meta AI
arXiv:2303.15389v1878 citationsh-index: 23Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of high training costs for CLIP models, offering significant performance gains with reduced resources, making it impactful for researchers and practitioners in vision-language AI.

The paper tackles improving the efficiency and effectiveness of CLIP training by proposing EVA-CLIP, which achieves 82.0% zero-shot top-1 accuracy on ImageNet-1K with 5.0B parameters and 9 billion samples, and 80.4% with 430 million parameters and 6 billion samples.

Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to the community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes