CVAICLMMJul 29, 2024

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

arXiv:2407.20341v115 citationsh-index: 66Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of evaluating image captions more accurately for researchers and practitioners in computer vision and natural language processing, representing an incremental improvement over existing metrics.

The paper tackles the problem of aligning machine-generated image caption evaluation with human judgment by proposing BRIDGE, a learnable and reference-free metric that integrates visual features into multimodal pseudo-captions, achieving state-of-the-art results on several datasets.

Effectively aligning with human judgment when evaluating machine-generated image captions represents a complex yet intriguing challenge. Existing evaluation metrics like CIDEr or CLIP-Score fall short in this regard as they do not take into account the corresponding image or lack the capability of encoding fine-grained details and penalizing hallucinations. To overcome these issues, in this paper, we propose BRIDGE, a new learnable and reference-free image captioning metric that employs a novel module to map visual features into dense vectors and integrates them into multi-modal pseudo-captions which are built during the evaluation process. This approach results in a multimodal metric that properly incorporates information from the input image without relying on reference captions, bridging the gap between human judgment and machine-generated image captions. Experiments spanning several datasets demonstrate that our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores. Our source code and trained models are publicly available at: https://github.com/aimagelab/bridge-score.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes