CVMMOct 11, 2023

CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing

arXiv:2310.07517v112 citationsh-index: 24
Originality Incremental advance
AI Analysis

This work addresses audio-visual video parsing for multimedia analysis, presenting an incremental improvement over existing methods.

The paper tackled the problem of audio-visual video parsing by addressing overlooked segment-level details and single-modality reliance, resulting in improved parsing performance on the Look, Listen, and Parse dataset.

Audio-visual video parsing is the task of categorizing a video at the segment level with weak labels, and predicting them as audible or visible events. Recent methods for this task leverage the attention mechanism to capture the semantic correlations among the whole video across the audio-visual modalities. However, these approaches have overlooked the importance of individual segments within a video and the relationship among them, and tend to rely on a single modality when learning features. In this paper, we propose a novel interactive-enhanced cross-modal perception method~(CM-PIE), which can learn fine-grained features by applying a segment-based attention module. Furthermore, a cross-modal aggregation block is introduced to jointly optimize the semantic representation of audio and visual signals by enhancing inter-modal interactions. The experimental results show that our model offers improved parsing performance on the Look, Listen, and Parse dataset compared to other methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes