Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
This work addresses the problem of improving audio and video analysis models for researchers and practitioners in multimedia AI, representing an incremental advance through novel method integration.
The paper tackled learning audio and video models from self-supervised synchronization, achieving state-of-the-art or comparable performance on audio classification benchmarks and significant gains in action recognition accuracy, such as +19.9% on UCF101 and +17.7% on HMDB51.
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.