Generating Question Relevant Captions to Aid Visual Question Answering
This addresses the problem of enhancing VQA accuracy for AI systems by leveraging connections between vision and language, representing an incremental advance through a novel joint training approach.
The paper tackles improving visual question answering (VQA) by generating captions targeted to answer specific visual questions, achieving state-of-the-art performance with 68.4% on the VQA v2 Test-standard set using a single model.
Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% on the Test-standard set using a single model) by simultaneously generating question-relevant captions.