CVJun 20, 2021

Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding

Chaolei Tan, Zihang Lin, Jian-Fang Hu, Xiang Li, Wei-Shi Zheng

arXiv:2106.10634v28.715 citations

Originality Incremental advance

AI Analysis

This work addresses video understanding for applications like surveillance or robotics, but it is incremental as it builds on existing methods like 2D-TAN and MDETR.

The paper tackles human-centric spatio-temporal video grounding by proposing a two-stage approach, improving temporal grounding with an augmented 2D-TAN and using MDETR with rules for spatial localization, achieving unspecified performance gains.

We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task. In the first stage, we propose an Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) to temporally ground the target moment corresponding to the given description. Primarily, we improve the original 2D-TAN from two aspects: First, a temporal context-aware Bi-LSTM Aggregation Module is developed to aggregate clip-level representations, replacing the original max-pooling. Second, we propose to employ Random Concatenation Augmentation (RCA) mechanism during the training phase. In the second stage, we use pretrained MDETR model to generate per-frame bounding boxes via language query, and design a set of hand-crafted rules to select the best matching bounding box outputted by MDETR for each frame within the grounded moment.

View on arXiv PDF

Similar