LGAISDASOct 20, 2025

Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning

arXiv:2510.21797v2h-index: 11
Originality Incremental advance
AI Analysis

This addresses the issue of dominant modalities skewing learning in multimodal systems, which is an incremental improvement over existing optimization- or data-based strategies.

The paper tackled the problem of multimodal imbalance in audio-visual learning by proposing a quantitative analysis framework and an adaptive loss function, achieving state-of-the-art performance with accuracies of 80.65% on CREMA-D, 70.40% on AVE, and 72.42% on KineticSound.

The heterogeneity of multimodal data leads to inconsistencies and imbalance, allowing a dominant modality to steer gradient updates. Existing solutions mainly focus on optimization- or data-based strategies but rarely exploit the information inherent in multimodal imbalance or conduct its quantitative analysis. To address this gap, we propose a novel quantitative analysis framework for Multimodal Imbalance and design a sample-level adaptive loss function. We define the Modality Gap as the Softmax score difference between modalities for the correct class and model its distribution using a bimodal Gaussian Mixture Model(GMM), representing balanced and imbalanced samples. Using Bayes' theorem, we estimate each sample's posterior probability of belonging to these two groups. Based on this, our adaptive loss (1) minimizes the overall Modality Gap, (2) aligns imbalanced samples with balanced ones, and (3) adaptively penalizes each according to its imbalance degree. A two-stage training strategy-warm-up and adaptive phases,yields state-of-the-art performance on CREMA-D (80.65%), AVE (70.40%), and KineticSound (72.42%). Fine-tuning with high-quality samples identified by the GMM further improves results, highlighting their value for effective multimodal fusion.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes