CVAIMay 10, 2023

Combo of Thinking and Observing for Outside-Knowledge VQA

arXiv:2305.06407v1227 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of integrating visual and textual knowledge for AI systems in visual question answering, representing an incremental advancement over existing approaches.

The paper tackles the challenge of outside-knowledge visual question answering by constraining cross-modality space into natural-language space to preserve visual features while leveraging textual knowledge, resulting in a 6.17% accuracy improvement over state-of-the-art methods.

Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge. Some existing solutions draw external knowledge into the cross-modality space which overlooks the much vaster textual knowledge in natural-language space, while others transform the image into a text that further fuses with the textual knowledge into the natural-language space and completely abandons the use of visual features. In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space which makes the visual features preserved directly, and the model still benefits from the vast knowledge in natural-language space. To this end, we propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder. Such structure allows us to introduce more types of knowledge including explicit and implicit multimodal and textual knowledge. Extensive experiments validate the superiority of the proposed method which outperforms the state-of-the-art by 6.17% accuracy. We also conduct comprehensive ablations of each component, and systematically study the roles of varying types of knowledge. Codes and knowledge data can be found at https://github.com/PhoebusSi/Thinking-while-Observing.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes