UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning
This addresses the challenge of diverse image caption evaluation for researchers and practitioners, though it is incremental as it builds on existing contrastive learning and benchmark improvements.
The paper tackles the problem of evaluating image captions without needing multiple reference captions by introducing UMIC, an unreferenced metric based on Vision-and-Language BERT trained with contrastive learning, and shows it achieves higher correlation than previous metrics requiring references across four datasets.
Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references. We release the benchmark dataset and pre-trained models to compute the UMIC.