CVJun 26, 2023

Self-Supervised Image Captioning with CLIP

arXiv:2306.15111v23.93 citationsh-index: 3

Originality Incremental advance

AI Analysis

This addresses the challenge of obtaining high-quality image-caption pairs for many domains, offering a more data-efficient solution for vision-language tasks.

The paper tackles the problem of image captioning's reliance on large labeled datasets by introducing a self-supervised method that uses CLIP to enhance image-caption relevance, achieving performance comparable to state-of-the-art models with less than 2% of the labeled COCO data and producing more distinctive and informative captions in human evaluations.

Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which can be hard to obtain for many domains. To address this, we introduce a self-supervised image captioning method. After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data, leveraging the auxiliary task of enhancing the CLIP relevance between images and generated captions. Remarkably, despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset. Human evaluations further reveal that our method produces captions with greater distinctiveness and informativeness, two attributes inherently challenging to achieve through supervised learning.

View on arXiv PDF

Similar