CVCLOct 12, 2021

Relation-aware Video Reading Comprehension for Temporal Language Grounding

arXiv:2110.05717v3670 citations
Originality Highly original
AI Analysis

This work addresses the problem of localizing video moments based on natural language queries for applications in video analysis, presenting a novel method that improves accuracy over previous approaches.

The paper tackles temporal language grounding in videos by formulating it as a video reading comprehension task, proposing a Relation-aware Network (RaNet) that achieves state-of-the-art performance on datasets like ActivityNet-Captions, TACoS, and Charades-STA.

Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes have been available.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes