Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering
This addresses the challenge of answering more general questions in VQA by better exploiting external knowledge, though it is incremental as it builds on existing KB-VQA methods.
The paper tackles the problem of Knowledge-based Visual Question Answering (KB-VQA), which requires external knowledge beyond visible image contents, by proposing a framework that generates multiple clues for reasoning with memory neural networks, achieving superior performance over other methods on two benchmarks.
Visual Question Answering (VQA) has emerged as one of the most challenging tasks in artificial intelligence due to its multi-modal nature. However, most existing VQA methods are incapable of handling Knowledge-based Visual Question Answering (KB-VQA), which requires external knowledge beyond visible contents to answer questions about a given image. To address this issue, we propose a novel framework that endows the model with capabilities of answering more general questions, and achieves a better exploitation of external knowledge through generating Multiple Clues for Reasoning with Memory Neural Networks (MCR-MemNN). Specifically, a well-defined detector is adopted to predict image-question related relation phrases, each of which delivers two complementary clues to retrieve the supporting facts from external knowledge base (KB), which are further encoded into a continuous embedding space using a content-addressable memory. Afterwards, mutual interactions between visual-semantic representation and the supporting facts stored in memory are captured to distill the most relevant information in three modalities (i.e., image, question, and KB). Finally, the optimal answer is predicted by choosing the supporting fact with the highest score. We conduct extensive experiments on two widely-used benchmarks. The experimental results well justify the effectiveness of MCR-MemNN, as well as its superiority over other KB-VQA methods.