CVJun 20, 2021

Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding

arXiv:2106.10634v215 citations
Originality Incremental advance
AI Analysis

This work addresses video understanding for applications like surveillance or robotics, but it is incremental as it builds on existing methods like 2D-TAN and MDETR.

The paper tackles human-centric spatio-temporal video grounding by proposing a two-stage approach, improving temporal grounding with an augmented 2D-TAN and using MDETR with rules for spatial localization, achieving unspecified performance gains.

We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task. In the first stage, we propose an Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) to temporally ground the target moment corresponding to the given description. Primarily, we improve the original 2D-TAN from two aspects: First, a temporal context-aware Bi-LSTM Aggregation Module is developed to aggregate clip-level representations, replacing the original max-pooling. Second, we propose to employ Random Concatenation Augmentation (RCA) mechanism during the training phase. In the second stage, we use pretrained MDETR model to generate per-frame bounding boxes via language query, and design a set of hand-crafted rules to select the best matching bounding box outputted by MDETR for each frame within the grounded moment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes