CVMay 4, 2016

Leveraging Visual Question Answering for Image-Caption Ranking

arXiv:1605.01379v216.588 citations

Originality Incremental advance

AI Analysis

This work addresses image-caption ranking for applications like image search and accessibility, but it is incremental as it builds on an existing state-of-the-art model by incorporating VQA knowledge.

The paper tackled the problem of image-caption ranking by using Visual Question Answering (VQA) as a feature extraction module to improve consistency between images and captions, resulting in a 7.1% improvement in caption retrieval and 4.4% in image retrieval on the MSCOCO dataset.

Visual Question Answering (VQA) is the task of taking as input an image and a free-form natural language question about the image, and producing an accurate answer. In this work we view VQA as a "feature extraction" module to extract image and caption representations. We employ these representations for the task of image-caption ranking. Each feature dimension captures (imagines) whether a fact (question-answer pair) could plausibly be true for the image and caption. This allows the model to interpret images and captions from a wide variety of perspectives. We propose score-level and representation-level fusion models to incorporate VQA knowledge in an existing state-of-the-art VQA-agnostic image-caption ranking model. We find that incorporating and reasoning about consistency between images and captions significantly improves performance. Concretely, our model improves state-of-the-art on caption retrieval by 7.1% and on image retrieval by 4.4% on the MSCOCO dataset.

View on arXiv PDF

Similar