Cross-modal Retrieval for Knowledge-based Visual Question Answering
This work addresses the problem of improving visual question answering for named entities, which is incremental as it builds on existing retrieval methods.
The paper tackles the challenge of recognizing named entities in knowledge-based visual question answering by using cross-modal retrieval to bridge the semantic gap between entities and their visual depictions. It shows that combining mono- and cross-modal retrieval with a CLIP-based model achieves competitive performance on three datasets, being simpler and cheaper than billion-parameter models.
Knowledge-based Visual Question Answering about Named Entities is a challenging task that requires retrieving information from a multimodal Knowledge Base. Named entities have diverse visual representations and are therefore difficult to recognize. We argue that cross-modal retrieval may help bridge the semantic gap between an entity and its depictions, and is foremost complementary with mono-modal retrieval. We provide empirical evidence through experiments with a multimodal dual encoder, namely CLIP, on the recent ViQuAE, InfoSeek, and Encyclopedic-VQA datasets. Additionally, we study three different strategies to fine-tune such a model: mono-modal, cross-modal, or joint training. Our method, which combines mono-and cross-modal retrieval, is competitive with billion-parameter models on the three datasets, while being conceptually simpler and computationally cheaper.