CVApr 26, 2021

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

arXiv:2104.12671v399 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of self-supervised learning for multimodal data, enabling improved search and retrieval across modalities without human supervision, though it appears incremental by extending contrastive learning with clustering.

The paper tackles the problem of learning a common multimodal embedding space from unlabeled videos to enable cross-modal retrieval and semantic grouping, achieving state-of-the-art results in zero-shot retrieval tasks on four datasets.

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes