LG AI CVOct 21, 2023

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi, Mohammad Rastegari, Sachin Mehta

U of Toronto

arXiv:2310.14108v16.63 citationsh-index: 47Has Code

Originality Incremental advance

AI Analysis

This work addresses the limitation of CLIP for researchers and practitioners needing enhanced visual representations in tasks like segmentation and detection, though it is incremental as it builds on existing CLIP and model zoo methods.

The paper tackled the problem of CLIP's lack of object localization capabilities by augmenting its training with pseudo-labels from task-specific vision models, resulting in improvements of up to 16.3% across various vision tasks without compromising its existing zero-shot classification abilities.

Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification.

View on arXiv PDF

Similar