Scaling Language-Image Pre-training via Masking
This work addresses the challenge of scaling vision-language learning for researchers and practitioners by offering a more efficient pre-training method, though it is incremental as it builds on existing CLIP frameworks.
The paper tackles the problem of inefficient training in language-image pre-training by introducing FLIP, a method that masks image patches to improve speed and accuracy, achieving dominant performance on downstream tasks compared to CLIP counterparts trained on the same data.
We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.