MASS: Overcoming Language Bias in Image-Text Matching
This addresses a key challenge in multimodal AI for improving visual accuracy in tasks like image-text retrieval, though it appears incremental as it builds on existing visual-language models.
The paper tackles the problem of language bias in image-text matching, where models rely too heavily on language priors and neglect visual content, by introducing the Multimodal Association Score (MASS) framework, which reduces this bias without additional training and maintains linguistic understanding.
Pretrained visual-language models have made significant advancements in multimodal tasks, including image-text retrieval. However, a major challenge in image-text matching lies in language bias, where models predominantly rely on language priors and neglect to adequately consider the visual content. We thus present Multimodal ASsociation Score (MASS), a framework that reduces the reliance on language priors for better visual accuracy in image-text matching problems. It can be seamlessly incorporated into existing visual-language models without necessitating additional training. Our experiments have shown that MASS effectively lessens language bias without losing an understanding of linguistic compositionality. Overall, MASS offers a promising solution for enhancing image-text matching performance in visual-language models.