CVLGMMSep 19, 2023

Improving CLIP Robustness with Knowledge Distillation and Self-Training

arXiv:2309.10361v16 citationsh-index: 14
Originality Incremental advance
AI Analysis

It addresses robustness issues in multi-modal computer vision for scenarios with scarce labeled data, but it is incremental as it builds on existing CLIP with a simple modification.

This paper tackles the problem of improving the robustness of the CLIP model in unsupervised learning by introducing LP-CLIP, a method that uses a linear probing layer with pseudo-labels and self-training, achieving state-of-the-art results compared to supervised techniques on various datasets.

This paper examines the robustness of a multi-modal computer vision model, CLIP (Contrastive Language-Image Pretraining), in the context of unsupervised learning. The main objective is twofold: first, to evaluate the robustness of CLIP, and second, to explore strategies for augmenting its robustness. To achieve this, we introduce a novel approach named LP-CLIP. This technique involves the distillation of CLIP features through the incorporation of a linear probing layer positioned atop its encoding structure. This newly added layer is trained utilizing pseudo-labels produced by CLIP, coupled with a self-training strategy. The LP-CLIP technique offers a promising approach to enhance the robustness of CLIP without the need for annotations. By leveraging a simple linear probing layer, we aim to improve the model's ability to withstand various uncertainties and challenges commonly encountered in real-world scenarios. Importantly, our approach does not rely on annotated data, which makes it particularly valuable in situations where labeled data might be scarce or costly to obtain. Our proposed approach increases the robustness of CLIP with SOTA results compared to supervised technique on various datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes