CVAILGMMSDASJan 8, 2024

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

arXiv:2401.04154v15 citationsh-index: 2
Originality Incremental advance
AI Analysis

This work addresses the computational and accuracy challenges in multimodal video analysis for platforms like YouTube, though it is incremental as it builds on existing Transformer and self-supervised methods.

The paper tackles the problem of efficient multimodal learning for audio-video classification by proposing AVT, an audio-video bottleneck Transformer that reduces cross-modality complexity and integrates self-supervised objectives, resulting in performance improvements such as 8% on Kinetics-Sounds and 3.8% on Epic-Kitchens-100.

Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes