CVMMDec 17, 2021

Contrastive Vision-Language Pre-training with Limited Resources

arXiv:2112.09331v344 citationsHas Code
AI Analysis

This work addresses the problem of resource-intensive pre-training for researchers with limited access, making multi-modal alignment more accessible, though it is incremental in improving efficiency.

The paper tackles the high resource requirements of contrastive vision-language pre-training by proposing methods that reduce data and computational needs, achieving competitive results with only 14M academic datasets and 8 GPUs, and matching or surpassing state-of-the-art with 100M web data.

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significantly cut down the heavy resource dependency and allow us to conduct dual-encoder multi-modal representation alignment with limited resources. Besides, we provide a reproducible baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our methods on large-scale data. We hope that this work will provide useful data points and experience for future research in contrastive vision-language pre-training. Code is available at https://github.com/zerovl/ZeroVL.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes