CL CV LGSep 10, 2021

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Jue Wang, Haofan Wang, Jincan Deng, Weijia Wu, Debing Zhang

arXiv:2109.04699v22.420 citationsh-index: 76

Originality Incremental advance

AI Analysis

This work addresses efficiency and data quality issues in cross-modal pre-training for vision-language models, offering a more resource-effective solution with broad applications in retrieval and classification tasks.

The paper tackles the high cost and data noise in cross-modal pre-training by proposing EfficientCLIP, which uses ensemble confident learning to filter noisy data and incorporates single-modal text data to improve generalization. The method achieves state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 of the training resources compared to CLIP and WenLan, while also showing strong generalization to single-modal tasks.

While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.

View on arXiv PDF

Similar