Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup
This addresses the problem of costly video annotations for researchers and practitioners in video understanding, offering an incremental improvement by integrating audio into semi-supervised learning.
The paper tackles semi-supervised video action recognition by proposing an audio-visual framework with a novel audio source localization-guided mixup method, achieving superior performance on datasets like UCF-51, Kinetics-400, and VGGSound.
Video action recognition is a challenging but important task for understanding and discovering what the video does. However, acquiring annotations for a video is costly, and semi-supervised learning (SSL) has been studied to improve performance even with a small number of labeled data in the task. Prior studies for semi-supervised video action recognition have mostly focused on using single modality - visuals - but the video is multi-modal, so utilizing both visuals and audio would be desirable and improve performance further, which has not been explored well. Therefore, we propose audio-visual SSL for video action recognition, which uses both visual and audio together, even with quite a few labeled data, which is challenging. In addition, to maximize the information of audio and video, we propose a novel audio source localization-guided mixup method that considers inter-modal relations between video and audio modalities. In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed semi-supervised audio-visual action recognition framework and audio source localization-guided mixup.