CLCVSep 8, 2019

Quality Estimation for Image Captions Based on Large-scale Human Evaluations

arXiv:1909.03396v2738 citations
AI Analysis

This addresses the issue of unreliable captions from state-of-the-art models for users of image captioning systems, though it is incremental as it builds on existing QE tasks.

The paper tackled the problem of automatically estimating the quality of image captions without ground-truth references, by developing a large-scale human evaluation dataset with over 600k ratings and showing that models trained on coarse ratings can effectively detect and filter low-quality captions.

Automatic image captioning has improved significantly over the last few years, but the problem is far from being solved, with state of the art models still often producing low quality captions when used in the wild. In this paper, we focus on the task of Quality Estimation (QE) for image captions, which attempts to model the caption quality from a human perspective and without access to ground-truth references, so that it can be applied at prediction time to detect low-quality captions produced on previously unseen images. For this task, we develop a human evaluation process that collects coarse-grained caption annotations from crowdsourced users, which is then used to collect a large scale dataset spanning more than 600k caption quality ratings. We then carefully validate the quality of the collected ratings and establish baseline models for this new QE task. Finally, we further collect fine-grained caption quality annotations from trained raters, and use them to demonstrate that QE models trained over the coarse ratings can effectively detect and filter out low-quality image captions, thereby improving the user experience from captioning systems.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes