CV AI CL MMJul 29, 2024

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

arXiv:2407.20341v114.715 citationsh-index: 66Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of evaluating image captions more accurately for researchers and practitioners in computer vision and natural language processing, representing an incremental improvement over existing metrics.

The paper tackles the problem of aligning machine-generated image caption evaluation with human judgment by proposing BRIDGE, a learnable and reference-free metric that integrates visual features into multimodal pseudo-captions, achieving state-of-the-art results on several datasets.

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

View on arXiv PDF Code

Similar