CL CV HCJun 15, 2020

On the use of human reference data for evaluating automatic image descriptions

arXiv:2006.08792v10.2

Originality Synthesis-oriented

AI Analysis

This addresses a foundational problem in computer vision and NLP for visually impaired users, but it is incremental as it critiques existing practices without proposing a new solution.

The paper identifies that the quality of current human-generated image description datasets is insufficient for evaluating automatic image description systems, which rely on similarity metrics like BLEU, and argues for better guidelines and alternative evaluation methods.

Automatic image description systems are commonly trained and evaluated using crowdsourced, human-generated image descriptions. The best-performing system is then determined using some measure of similarity to the reference data (BLEU, Meteor, CIDER, etc). Thus, both the quality of the systems as well as the quality of the evaluation depends on the quality of the descriptions. As Section 2 will show, the quality of current image description datasets is insufficient. I argue that there is a need for more detailed guidelines that take into account the needs of visually impaired users, but also the feasibility of generating suitable descriptions. With high-quality data, evaluation of image description systems could use reference descriptions, but we should also look for alternatives.

View on arXiv PDF

Similar