CV CL MMMay 3, 2017

FOIL it! Find One mismatch between Image and Language caption

Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Aurelie Herbelot, Moin Nabi, Enver Sangineto, Raffaella Bernardi

arXiv:1705.01359v1158 citations

Originality Incremental advance

AI Analysis

This work addresses the need for fine-grained evaluation of multimodal understanding in AI, highlighting a critical gap in model capabilities.

The paper tackles the problem of assessing whether language and vision models truly understand cross-modal interactions by introducing FOIL-COCO, a dataset with correct and subtly incorrect captions. They show that current models perform poorly on tasks like distinguishing correct from foil captions, while humans achieve near-perfect performance.

In this paper, we aim to understand whether current language and vision (LaVi) models truly grasp the interaction between the two modalities. To this end, we propose an extension of the MSCOCO dataset, FOIL-COCO, which associates images with both correct and "foil" captions, that is, descriptions of the image that are highly similar to the original ones, but contain one single mistake ("foil word"). We show that current LaVi models fall into the traps of this data and perform badly on three tasks: a) caption classification (correct vs. foil); b) foil word detection; c) foil word correction. Humans, in contrast, have near-perfect performance on those tasks. We demonstrate that merely utilising language cues is not enough to model FOIL-COCO and that it challenges the state-of-the-art by requiring a fine-grained understanding of the relation between text and image.

View on arXiv PDF

Similar