CL CVMay 22, 2018

Joint Image Captioning and Question Answering

arXiv:1805.08389v11.914 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of improving VQA and image captioning performance for AI systems, but it is incremental as it builds on existing tasks and datasets.

The paper tackles the difficulty of visual question answering (VQA) systems in learning from limited answer supervision and the lack of diversity in image captioning by proposing a joint system that uses question-related captions to enhance VQA and improves caption informativeness. Results show VQA accuracy of 65.8% with generated captions and 69.1% with annotated captions on the VQA v2 validation set, and 68.4% on the test-standard set.

Answering visual questions need acquire daily common knowledge and model the semantic connection among different parts in images, which is too difficult for VQA systems to learn from images with the only supervision from answers. Meanwhile, image captioning systems with beam search strategy tend to generate similar captions and fail to diversely describe images. To address the aforementioned issues, we present a system to have these two tasks compensate with each other, which is capable of jointly producing image captions and answering visual questions. In particular, we utilize question and image features to generate question-related captions and use the generated captions as additional features to provide new knowledge to the VQA system. For image captioning, our system attains more informative results in term of the relative improvements on VQA tasks as well as competitive results using automated metrics. Applying our system to the VQA tasks, our results on VQA v2 dataset achieve 65.8% using generated captions and 69.1% using annotated captions in validation set and 68.4% in the test-standard set. Further, an ensemble of 10 models results in 69.7% in the test-standard split.

View on arXiv PDF

Similar