CVDec 13, 2019

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

arXiv:1912.06430v4791 citations
Originality Highly original
AI Analysis

This work addresses the scalability issue in video annotation for researchers and practitioners in computer vision, offering a novel self-supervised approach that is not incremental.

The paper tackles the problem of learning video representations without manual annotation by addressing misalignments in narrated videos, resulting in a method that outperforms all published self-supervised approaches and several fully supervised baselines across eight datasets.

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Code Implementations4 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes