CVCLApr 18, 2021

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

arXiv:2104.08718v32843 citations
Originality Incremental advance
AI Analysis

This provides a more human-like evaluation method for image captioning, though it is incremental as it builds on existing CLIP technology.

The paper tackles the problem of evaluating image captioning without human references by introducing CLIPScore, a reference-free metric based on CLIP, which achieves the highest correlation with human judgments, outperforming existing metrics like CIDEr and SPICE, and a reference-augmented version, RefCLIPScore, further improves this correlation.

Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes