CVMMSDASDec 15, 2022

MAViL: Masked Audio-Video Learners

Meta AIMIT
arXiv:2212.08071v286 citationsh-index: 63
AI Analysis

This work addresses the challenge of self-supervised learning in multimodal AI, offering a novel approach that outperforms externally supervised models on key benchmarks.

The paper tackles the problem of learning audio-visual representations by introducing MAViL, which combines masked reconstruction, contrastive learning, and self-training, achieving state-of-the-art results on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy).

We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and inter-modal contrastive learning with masking, and (3) self-training by reconstructing joint audio-video contextualized features learned from the first two objectives. Pre-training with MAViL not only enables the model to perform well in audio-visual classification and retrieval tasks but also improves representations of each modality in isolation, without using information from the other modality for fine-tuning or inference. Empirically, MAViL sets a new state-of-the-art on AudioSet (53.1 mAP) and VGGSound (67.1% accuracy). For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on these benchmarks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes