EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
This addresses the need for explainable and reliable evaluation metrics in image captioning for researchers and practitioners, though it is incremental in improving upon existing metric frameworks.
The authors tackled the problem of unstandardized and unverified explanations in image captioning evaluation metrics by proposing EXPERT, a reference-free metric that provides structured explanations based on fluency, relevance, and descriptiveness. The method achieved state-of-the-art results on benchmark datasets and significantly higher-quality explanations than existing metrics, as validated through human evaluation.
Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.