CLCVLGSep 10, 2021

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

arXiv:2109.04699v220 citations
Originality Incremental advance
AI Analysis

This work addresses efficiency and data quality issues in cross-modal pre-training for vision-language models, offering a more resource-effective solution with broad applications in retrieval and classification tasks.

The paper tackles the high cost and data noise in cross-modal pre-training by proposing EfficientCLIP, which uses ensemble confident learning to filter noisy data and incorporates single-modal text data to improve generalization. The method achieves state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 of the training resources compared to CLIP and WenLan, while also showing strong generalization to single-modal tasks.

While large scale pre-training has achieved great achievements in bridging the gap between vision and language, it still faces several challenges. First, the cost for pre-training is expensive. Second, there is no efficient way to handle the data noise which degrades model performance. Third, previous methods only leverage limited image-text paired data, while ignoring richer single-modal data, which may result in poor generalization to single-modal downstream tasks. In this work, we propose an EfficientCLIP method via Ensemble Confident Learning to obtain a less noisy data subset. Extra rich non-paired single-modal text data is used for boosting the generalization of text branch. We achieve the state-of-the-art performance on Chinese cross-modal retrieval tasks with only 1/10 training resources compared to CLIP and WenLan, while showing excellent generalization to single-modal tasks, including text retrieval and text classification.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes