Jianheng Dai

50.2LGMay 24

QASA: Quality-Aware Semantic Augmentation for Robust Multimodal Sentiment Analysis

Jiazhang Liang, Jianheng Dai, Miaosen Luo et al.

Multimodal large language models have demonstrated strong ability in capturing semantic representations for multimodal sentiment analysis. Their capacity to learn stable and generalizable multimodal features is limited, however, by the scarcity of high-quality training data. To address this, we propose QASA (Quality-Aware Semantic Augmentation), which uses diffusion models to generate augmented visual and auditory samples, thereby enlarging the training dataset and supporting multimodal learning. The generated samples can vary in quality and may exhibit cross-modal inconsistencies. To manage this, we introduce a decoupled quality-aware scoring module that assigns training weights based on the reliability of each augmented sample. This approach reduces the influence of low-quality data and contributes to more stable and robust model training. The framework combines the generative capabilities of diffusion models with the semantic reasoning of multimodal large models, providing an automated data augmentation strategy that does not require human annotation while improving generalization and robustness under limited high-quality data. Experiments on the CH-SIMS dataset show that QASA yields a relative increase of 18.0\% and 5.9\% in five-class accuracy (Acc5) and binary accuracy (Acc2), respectively, and it also outperforms existing methods on the CMU-MOSI and MUStARD benchmarks.

15.9AIMay 27

A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

Jianheng Dai, Jiazhang Liang, Sijie Mai

Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far more expressive than their acoustic and visual counterparts, the text modality tends to dominate optimization, suppressing weaker modalities and inducing gradient norm conflicts that destabilize training. To address this, we propose a Conflict-aware Penalty (CP) that detects and penalizes gradient norm conflicts at each training step, and a Statistical Loss (SL) that aligns predicted distribution statistics with empirical input statistics. Crucially, CP prevents dominant modality gradients from interfering with the SL objective, enabling synergistic training within a unified framework incorporating adaptive modality encoding, gated cross-modal fusion, and unimodal auxiliary heads. Experiments on CMU-MOSI demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of each component.

Jianheng Dai

2 Papers