CL CVSep 21, 2023

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

Elisa Kreiss, Eric Zelikman, Christopher Potts, Nick Haber

arXiv:2309.11710v11.76 citationsh-index: 58Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for reliable referenceless metrics in image description generation, which is incremental as it builds on prior work on context importance.

The paper tackles the problem of evaluating referenceless metrics for image description generation by introducing ContextRef, a benchmark that includes human ratings and robustness checks to assess alignment with human preferences, and finds that none of the existing methods succeed but fine-tuning yields improvements.

Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.

View on arXiv PDF Code

Similar