CVDec 30, 2024

LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing

arXiv:2412.20872v25 citationsh-index: 22ICASSP
Originality Incremental advance
AI Analysis

This work addresses modality misalignment in audio-visual video parsing, which is an incremental improvement for researchers in multimodal learning.

The paper tackles the problem of audio-visual video parsing with weak labels by addressing modality misalignment that introduces noise, and it introduces LINK to dynamically adjust modality contributions and use pseudo-labels for noise reduction, resulting in outperforming existing methods on the LLP dataset.

Audio-visual video parsing focuses on classifying videos through weak labels while identifying events as either visible, audible, or both, alongside their respective temporal boundaries. Many methods ignore that different modalities often lack alignment, thereby introducing extra noise during modal interaction. In this work, we introduce a Learning Interaction method for Non-aligned Knowledge (LINK), designed to equilibrate the contributions of distinct modalities by dynamically adjusting their input during event prediction. Additionally, we leverage the semantic information of pseudo-labels as a priori knowledge to mitigate noise from other modalities. Our experimental findings demonstrate that our model outperforms existing methods on the LLP dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes