CVFeb 9

CAE-AV: Improving Audio-Visual Learning via Cross-modal Interactive Enrichment

arXiv:2602.08309v11.5h-index: 1

Originality Incremental advance

AI Analysis

This addresses audio-visual misalignment issues for multimedia and AI applications, representing an incremental improvement over existing methods.

The paper tackled the problem of modality misalignment in audio-visual learning by proposing CAE-AV, a framework that uses cross-modal interactive enrichment to improve representation quality, achieving state-of-the-art performance on multiple benchmarks.

Audio-visual learning suffers from modality misalignment caused by off-screen sources and background clutter, and current methods usually amplify irrelevant regions or moments, leading to unstable training and degraded representation quality. To address this challenge, we proposed a novel Caption-aligned and Agreement-guided Enhancement framework (CAE-AV) for audio-visual learning, which used two complementary modules: Cross-modal Agreement-guided Spatio-Temporal Enrichment (CASTE) and Caption-Aligned Saliency-guided Enrichment (CASE) to relieve audio-visual misalignment. CASTE dynamically balances spatial and temporal relations by evaluating frame-level audio-visual agreement, ensuring that key information is captured from both preceding and subsequent frames under misalignment. CASE injects cross-modal semantic guidance into selected spatio-temporal positions, leveraging high-level semantic cues to further alleviate misalignment. In addition, we design lightweight objectives, caption-to-modality InfoNCE, visual-audio consistency, and entropy regularization to guide token selection and strengthen cross-modal semantic alignment. With frozen backbones, CAE-AV achieves state-of-the-art performance on AVE, AVVP, AVS, and AVQA benchmarks, and qualitative analyses further validate its robustness against audio-visual misalignment.

View on arXiv PDF

Similar