CVNov 19, 2020

Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

arXiv:2011.09634v2Has Code
AI Analysis

This work is significant for researchers working on cross-modal understanding, particularly in scenarios with noisy, unannotated real-world video data, by providing a method to learn mappings between language and video.

This paper addresses the challenge of mapping natural language sentences to noisy real-world video snippets without explicit annotations. The authors propose a self-supervised learning framework with an adversarial module to handle noise, achieving state-of-the-art performance on bidirectional retrieval tasks between sentences and videos.

In this paper, we teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. Firstly, we define a self-supervised learning framework that captures the cross-modal information. A novel adversarial learning module is then introduced to explicitly handle the noises in the natural videos, where the subtitle sentences are not guaranteed to be strongly corresponded to the video snippets. For training and evaluation, we contribute a new dataset `ApartmenTour' that contains a large number of online videos and subtitles. We carry out experiments on the bidirectional retrieval tasks between sentences and videos, and the results demonstrate that our proposed model achieves the state-of-the-art performance on both retrieval tasks and exceeds several strong baselines. The dataset can be downloaded at https://github.com/zyj-13/WAL.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes