CVAug 4, 2017

Localizing Moments in Video with Natural Language

arXiv:1708.01641v11204 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of pinpointing when events occur in videos for applications like video search and analysis, though it is incremental as it builds on existing video retrieval methods.

The paper tackles the problem of retrieving specific temporal segments in videos using natural language descriptions, and the result is the Moment Context Network (MCN) model that outperforms baseline methods, supported by the new DiDeMo dataset with over 10,000 videos.

We consider retrieving a specific temporal segment, or moment, from a video given a natural language text description. Methods designed to retrieve whole video clips with natural language determine what occurs in a video but not when. To address this issue, we propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. A key obstacle to training our MCN model is that current video datasets do not include pairs of localized video segments and referring expressions, or text descriptions which uniquely identify a corresponding moment. Therefore, we collect the Distinct Describable Moments (DiDeMo) dataset which consists of over 10,000 unedited, personal videos in diverse visual settings with pairs of localized video segments and referring expressions. We demonstrate that MCN outperforms several baseline methods and believe that our initial results together with the release of DiDeMo will inspire further research on localizing video moments with natural language.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes