CLAICVJun 30, 2025

EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations

arXiv:2506.24016v16 citationsh-index: 3Has CodeACL
Originality Highly original
AI Analysis

This addresses the need for explainable and reliable evaluation metrics in image captioning for researchers and practitioners, though it is incremental in improving upon existing metric frameworks.

The authors tackled the problem of unstandardized and unverified explanations in image captioning evaluation metrics by proposing EXPERT, a reference-free metric that provides structured explanations based on fluency, relevance, and descriptiveness. The method achieved state-of-the-art results on benchmark datasets and significantly higher-quality explanations than existing metrics, as validated through human evaluation.

Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes