CLCVSep 4, 2019

TIGEr: Text-to-Image Grounding for Image Caption Evaluation

arXiv:1909.02050v11020 citations
Originality Incremental advance
AI Analysis

This addresses the need for more reliable evaluation metrics in image captioning research, though it is incremental as it builds upon existing text-image grounding models.

The paper tackles the problem of biased automatic evaluation in image captioning by introducing TIGEr, a new metric that uses text-image grounding to assess caption quality based on image content and human captions, showing higher consistency with human judgments than existing metrics.

This paper presents a new metric called TIGEr for the automatic evaluation of image captioning systems. Popular metrics, such as BLEU and CIDEr, are based solely on text matching between reference captions and machine-generated captions, potentially leading to biased evaluations because references may not fully cover the image content and natural language is inherently ambiguous. Building upon a machine-learned text-image grounding model, TIGEr allows to evaluate caption quality not only based on how well a caption represents image content, but also on how well machine-generated captions match human-generated captions. Our empirical tests show that TIGEr has a higher consistency with human judgments than alternative existing metrics. We also comprehensively assess the metric's effectiveness in caption evaluation by measuring the correlation between human judgments and metric scores.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes