CVMMSep 15, 2023

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

arXiv:2309.08738v212 citationsh-index: 10
Originality Incremental advance
AI Analysis

This work addresses video representation learning for computer vision applications, showing incremental improvement by integrating audio to enhance visual features in challenging scenarios.

The paper tackles the challenge of learning high-quality video representations, especially in low-resolution or blurry videos, by proposing AV-MaskEnhancer, which combines audio and visual information through a masked autoencoder. The result achieves state-of-the-art performance on the UCF101 dataset with a top-1 accuracy of 98.8% and top-5 accuracy of 99.9%.

Learning high-quality video representation has shown significant applications in computer vision and remains challenging. Previous work based on mask autoencoders such as ImageMAE and VideoMAE has proven the effectiveness of learning representations in images and videos through reconstruction strategy in the visual modality. However, these models exhibit inherent limitations, particularly in scenarios where extracting features solely from the visual modality proves challenging, such as when dealing with low-resolution and blurry original videos. Based on this, we propose AV-MaskEnhancer for learning high-quality video representation by combining visual and audio information. Our approach addresses the challenge by demonstrating the complementary nature of audio and video features in cross-modality content. Moreover, our result of the video classification task on the UCF101 dataset outperforms the existing work and reaches the state-of-the-art, with a top-1 accuracy of 98.8% and a top-5 accuracy of 99.9%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes