CVMar 18, 2025

See-Saw Modality Balance: See Gradient, and Sew Impaired Vision-Language Balance to Mitigate Dominant Modality Bias

arXiv:2503.13834v114 citationsh-index: 5NAACL
Originality Incremental advance
AI Analysis

This addresses performance degradation in vision-language models when one modality is impaired, though it is incremental as it builds on existing gradient-based methods.

The paper tackles the problem of dominant modality bias in vision-language models, where models over-rely on one modality, by proposing the BalGrad framework, which uses gradient reweighting and projection to balance modalities, and shows effectiveness on datasets like UPMC Food-101, Hateful Memes, and MM-IMDb.

Vision-language (VL) models have demonstrated strong performance across various tasks. However, these models often rely on a specific modality for predictions, leading to "dominant modality bias.'' This bias significantly hurts performance, especially when one modality is impaired. In this study, we analyze model behavior under dominant modality bias and theoretically show that unaligned gradients or differences in gradient magnitudes prevent balanced convergence of the loss. Based on these findings, we propose a novel framework, BalGrad to mitigate dominant modality bias. Our approach includes inter-modality gradient reweighting, adjusting the gradient of KL divergence based on each modality's contribution, and inter-task gradient projection to align task directions in a non-conflicting manner. Experiments on UPMC Food-101, Hateful Memes, and MM-IMDb datasets confirm that BalGrad effectively alleviates over-reliance on specific modalities when making predictions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes