CVNov 30, 2018

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

arXiv:1812.00087v2340 citations
Originality Highly original
AI Analysis

This work addresses the problem of retrieving specific video moments based on natural language queries for applications in video analysis, with incremental improvements in modeling temporal relations.

The paper tackles natural language moment retrieval in untrimmed videos by addressing semantic and structural misalignment, introducing the Moment Alignment Network (MAN) which unifies moment encoding and temporal reasoning, and achieves significant performance improvements over state-of-the-art methods on DiDeMo and Charades-STA benchmarks.

This research strives for natural language moment retrieval in long, untrimmed video streams. The problem is not trivial especially when a video contains multiple moments of interests and the language describes complex temporal dependencies, which often happens in real scenarios. We identify two crucial challenges: semantic misalignment and structural misalignment. However, existing approaches treat different moments separately and do not explicitly model complex moment-wise temporal relations. In this paper, we present Moment Alignment Network (MAN), a novel framework that unifies the candidate moment encoding and temporal structural reasoning in a single-shot feed-forward network. MAN naturally assigns candidate moment representations aligned with language semantics over different temporal locations and scales. Most importantly, we propose to explicitly model moment-wise temporal relations as a structured graph and devise an iterative graph adjustment network to jointly learn the best structure in an end-to-end manner. We evaluate the proposed approach on two challenging public benchmarks DiDeMo and Charades-STA, where our MAN significantly outperforms the state-of-the-art by a large margin.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes