MMCVSDASMay 28, 2025

Mitigating Audiovisual Mismatch in Visual-Guide Audio Captioning

arXiv:2505.22045v11 citationsh-index: 6INTERSPEECH
Originality Highly original
AI Analysis

This work addresses a critical issue in multimodal AI for real-world applications like dubbed content, offering a novel solution to enhance model robustness against mismatched data.

The paper tackled the problem of audiovisual misalignment in vision-guided audio captioning by introducing an entropy-aware gated fusion framework and a batch-wise shuffling technique, resulting in superior performance on the AudioCaps benchmark and a 6x improvement in inference speed.

Current vision-guided audio captioning systems frequently fail to address audiovisual misalignment in real-world scenarios, such as dubbed content or off-screen sounds. To bridge this critical gap, we present an entropy-aware gated fusion framework that dynamically modulates visual information flow through cross-modal uncertainty quantification. Our novel approach employs attention entropy analysis in cross-attention layers to automatically identify and suppress misleading visual cues during modal fusion. Complementing this architecture, we develop a batch-wise audiovisual shuffling technique that generates synthetic mismatched training pairs, greatly enhancing model resilience against alignment noise. Evaluations on the AudioCaps benchmark demonstrate our system's superior performance over existing baselines, especially in mismatched modality scenarios. Furthermore, our solution demonstrates an approximately 6x improvement in inference speed compared to the baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes