CL CVMay 18, 2021

Parallel Attention Network with Sequence Matching for Video Grounding

Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, Rick Siow Mong Goh

arXiv:2105.08481v131.7715 citations

Originality Incremental advance

AI Analysis

This work addresses video grounding for applications like video search and analysis, but it is incremental as it builds on existing methods with specific improvements.

The paper tackles video grounding by proposing SeqPAN, a method that uses a self-guided parallel attention module and sequence matching strategy to improve temporal moment retrieval from videos based on language queries, achieving superior performance on three datasets.

Given a video, video grounding aims to retrieve a temporal moment that semantically corresponds to a language query. In this work, we propose a Parallel Attention Network with Sequence matching (SeqPAN) to address the challenges in this task: multi-modal representation learning, and target moment boundary prediction. We design a self-guided parallel attention module to effectively capture self-modal contexts and cross-modal attentive information between video and text. Inspired by sequence labeling tasks in natural language processing, we split the ground truth moment into begin, inside, and end regions. We then propose a sequence matching strategy to guide start/end boundary predictions using region labels. Experimental results on three datasets show that SeqPAN is superior to state-of-the-art methods. Furthermore, the effectiveness of the self-guided parallel attention module and the sequence matching module is verified.

View on arXiv PDF

Similar