CVAICLJun 10, 2024

FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model

arXiv:2406.06004v152 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the need for more interpretable and cost-effective evaluation in image captioning, though it is incremental as it builds on existing reference-free approaches.

The paper tackles the problem of evaluating image captions without needing expensive reference captions, proposing FLEUR, an explainable reference-free metric that uses a large multimodal model to assign scores and provide explanations, achieving state-of-the-art results on benchmarks like Flickr8k-CF, COMPOSITE, and Pascal-50S with high correlations to human judgment.

Most existing image captioning evaluation metrics focus on assigning a single numerical score to a caption by comparing it with reference captions. However, these methods do not provide an explanation for the assigned score. Moreover, reference captions are expensive to acquire. In this paper, we propose FLEUR, an explainable reference-free metric to introduce explainability into image captioning evaluation metrics. By leveraging a large multimodal model, FLEUR can evaluate the caption against the image without the need for reference captions, and provide the explanation for the assigned score. We introduce score smoothing to align as closely as possible with human judgment and to be robust to user-defined grading criteria. FLEUR achieves high correlations with human judgment across various image captioning evaluation benchmarks and reaches state-of-the-art results on Flickr8k-CF, COMPOSITE, and Pascal-50S within the domain of reference-free evaluation metrics. Our source code and results are publicly available at: https://github.com/Yebin46/FLEUR.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes