CL IR LG MMJan 11, 2023

Multimodal Inverse Cloze Task for Knowledge-based Visual Question Answering

Paul Lerner, Olivier Ferret, Camille Guinaudeau

arXiv:2301.04366v12.913 citationsh-index: 9Has Code

Originality Incremental advance

AI Analysis

This work addresses the data scarcity problem for multimodal fusion in KVQAE, offering an incremental improvement by adapting existing textual methods to enhance model performance in a domain-specific task.

The paper tackles the challenge of training complex fusion models for Knowledge-based Visual Question Answering about named Entities (KVQAE) by introducing a new pre-training method called Multimodal Inverse Cloze Task, which adapts a textual approach to multimodal contexts and achieves a 9% relative-MRR gain for retrieval and a 15% relative-F1 gain for reading comprehension over a baseline without pre-training.

We present a new pre-training method, Multimodal Inverse Cloze Task, for Knowledge-based Visual Question Answering about named Entities (KVQAE). KVQAE is a recently introduced task that consists in answering questions about named entities grounded in a visual context using a Knowledge Base. Therefore, the interaction between the modalities is paramount to retrieve information and must be captured with complex fusion models. As these models require a lot of training data, we design this pre-training task from existing work in textual Question Answering. It consists in considering a sentence as a pseudo-question and its context as a pseudo-relevant passage and is extended by considering images near texts in multimodal documents. Our method is applicable to different neural network architectures and leads to a 9% relative-MRR and 15% relative-F1 gain for retrieval and reading comprehension, respectively, over a no-pre-training baseline.

View on arXiv PDF Code

Similar