AIJun 27, 2024

Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA

arXiv:2406.18839v14.22 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of multi-hop questions in VQA for AI systems, though it is incremental as it builds on existing methods with modest gains.

The paper tackles the problem of knowledge-based visual question-answering by decomposing complex questions into simpler ones to improve information extraction from images and comprehension, achieving up to 2% accuracy improvement on datasets like OKVQA, A-OKVQA, and KRVQA.

We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Although many recent works use question-dependent captioners to verbalize the given image and use Large Language Models to solve the VQA problem, the research results show they are not reasonably performing for multi-hop questions. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it. Moreover, we analyze the decomposed questions to find out the modality of the information that is required to answer them and use a captioner for the visual questions and LLMs as a general knowledge source for the non-visual KB-based questions. Our results demonstrate the positive impact of using simple questions before retrieving visual or non-visual information. We have provided results and analysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA, and achieved up to 2% improvement in accuracy.

View on arXiv PDF

Similar