RWKV-CLIP: A Robust Vision-Language Representation Learner
This work addresses data quality issues in vision-language models for researchers and practitioners, offering an incremental improvement through novel data processing and model architecture.
The paper tackled the problem of noisy data in CLIP by introducing a diverse description generation framework using LLMs and proposed RWKV-CLIP, a model combining transformer training efficiency with RNN inference, achieving state-of-the-art performance in tasks like zero-shot classification and image-text retrieval.
Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP