CVAICLMar 18, 2025

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives

arXiv:2503.14604v217 citationsh-index: 30IJCAI
Originality Synthesis-oriented
AI Analysis

This addresses the problem of robust evaluation for image captioning in the age of MLLMs, which is incremental as it surveys and critiques existing approaches without introducing new methods.

This survey tackles the challenge of evaluating image captions generated by Multimodal LLMs by analyzing existing metrics' strengths and limitations across dimensions like human judgment correlation and sensitivity to hallucinations, highlighting limitations and suggesting future research directions.

The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes