CVMMSDASJul 29, 2022

UAVM: Towards Unifying Audio and Visual Models

MIT
arXiv:2208.00061v231 citationsh-index: 22
Originality Incremental advance
AI Analysis

This work addresses the challenge of modality separation in audio-visual learning for researchers and practitioners, though it appears incremental as it builds on existing models.

The paper tackles the problem of independent audio and video branches in conventional audio-visual models by proposing a Unified Audio-Visual Model (UAVM), which achieves a new state-of-the-art accuracy of 65.8% on VGGSound for audio-visual event classification.

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes