CV MM SD ASJul 29, 2022

UAVM: Towards Unifying Audio and Visual Models

Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, James Glass

MIT

arXiv:2208.00061v214.131 citationsh-index: 22Has Code

Originality Incremental advance

AI Analysis

This work addresses the challenge of modality separation in audio-visual learning for researchers and practitioners, though it appears incremental as it builds on existing models.

The paper tackles the problem of independent audio and video branches in conventional audio-visual models by proposing a Unified Audio-Visual Model (UAVM), which achieves a new state-of-the-art accuracy of 65.8% on VGGSound for audio-visual event classification.

Conventional audio-visual models have independent audio and video branches. In this work, we unify the audio and visual branches by designing a Unified Audio-Visual Model (UAVM). The UAVM achieves a new state-of-the-art audio-visual event classification accuracy of 65.8% on VGGSound. More interestingly, we also find a few intriguing properties of UAVM that the modality-independent counterparts do not have.

View on arXiv PDF Code

Similar