CVIRAug 4, 2020

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

arXiv:2008.01403v2145 citations
AI Analysis

This work addresses the problem of localizing video segments based on text queries for applications in video retrieval and analysis, representing an incremental improvement with a novel hybrid method.

The paper tackles query-based moment localization by proposing a Cross- and Self-Modal Graph Attention Network (CSMGAN) that uses iterative message passing over a joint graph to capture high-order interactions between video and sentence modalities, resulting in significant outperformance over state-of-the-art methods on four public datasets.

Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query. In this localization task, one should pay more attention to thoroughly mine visual and linguistic information. To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Specifically, the joint graph consists of Cross-Modal interaction Graph (CMG) and Self-Modal relation Graph (SMG), where frames and words are represented as nodes, and the relations between cross- and self-modal node pairs are described by an attention mechanism. Through parametric message passing, CMG highlights relevant instances across video and sentence, and then SMG models the pairwise relation inside each modality for frame (word) correlating. With multiple layers of such a joint graph, our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization. Besides, to better comprehend the contextual details in the query, we develop a hierarchical sentence encoder to enhance the query understanding. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed model, and GCSMAN significantly outperforms the state-of-the-arts.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes