CVLGNov 9, 2021

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

arXiv:2111.05329v516 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving audio-visual representation learning for downstream tasks like action recognition and sound classification, offering a novel approach that is incremental in enhancing self-supervised methods.

The paper tackles the problem of learning audio-visual representations by introducing a self-supervised framework that relaxes temporal synchronicity, enabling learning of asynchronous cross-modal relationships, and it achieves state-of-the-art or competitive performance on action recognition, sound classification, and action retrieval tasks, with specific gains such as outperforming fully-supervised pretraining on Kinetics-Sound.

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes