CVAug 25, 2022

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

arXiv:2208.12262v2265 citationsh-index: 79Has Code
Originality Incremental advance
AI Analysis

This work addresses the need for better local semantic learning in vision-language models, offering incremental improvements for tasks like image classification and retrieval.

The paper tackles the problem of improving contrastive language-image pretraining by introducing masked self-distillation to learn local patch representations, resulting in superior performance on downstream tasks such as linear probing, fine-tuning, and zero-shot evaluation.

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes