Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching
This work addresses the resource-intensive nature of pre-training large multimodal models by enabling efficient transfer to smaller models, which is incremental as it builds on existing CLIP and representation transfer techniques.
The paper tackles the problem of transferring representations from a large pre-trained multimodal model (CLIP-ViT) to a smaller target model (e.g., ResNet-18) using cross-modal similarity matching, achieving a top-1 linear probe accuracy of 66.2% on ImageNet-1K, which outperforms vision-only self-supervised methods like SimCLR (51.8%) and SwAV (63.7%).
Despite surprising performance on zero-shot transfer, pre-training a large-scale multimodal model is often prohibitive as it requires a huge amount of data and computing resources. In this paper, we propose a method (BeamCLIP) that can effectively transfer the representations of a large pre-trained multimodal model (CLIP-ViT) into a small target model (e.g., ResNet-18). For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model by matching the relative similarity distribution across text prompt embeddings. To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts. Our experiments show that unsupervised representation transfer of a pre-trained vision-language model enables a small ResNet-18 to achieve a better ImageNet-1K top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning (SSL) methods (e.g., SimCLR: 51.8%, SwAV: 63.7%), while closing the gap with supervised learning (69.8%).