Learning content and context with language bias for Visual Question Answering
This work is significant for researchers and practitioners in multimodal AI, particularly VQA, by offering a method to balance language bias reduction with the retention of beneficial contextual learning.
This paper addresses the challenge in Visual Question Answering (VQA) where models either ignore visual content due to language bias or lose contextual understanding when bias is reduced. The proposed CCB strategy enables VQA models to leverage both visual content and language context, leading to improved accuracy on VQA-CP v2.
Visual Question Answering (VQA) is a challenging multimodal task to answer questions about an image. Many works concentrate on how to reduce language bias which makes models answer questions ignoring visual content and language context. However, reducing language bias also weakens the ability of VQA models to learn context prior. To address this issue, we propose a novel learning strategy named CCB, which forces VQA models to answer questions relying on Content and Context with language Bias. Specifically, CCB establishes Content and Context branches on top of a base VQA model and forces them to focus on local key content and global effective context respectively. Moreover, a joint loss function is proposed to reduce the importance of biased samples and retain their beneficial influence on answering questions. Experiments show that CCB outperforms the state-of-the-art methods in terms of accuracy on VQA-CP v2.