CVMMAug 16, 2020

Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding

arXiv:2008.06941v243 citations
Originality Incremental advance
AI Analysis

This addresses a challenging video understanding problem for AI systems, but appears incremental as it builds on existing grounding methods.

The paper tackles spatio-temporal video grounding on unaligned data and multi-form sentences by proposing an object-aware multi-branch relation network to capture critical object relationships, achieving effectiveness as shown in extensive experiments.

Spatio-temporal video grounding aims to retrieve the spatio-temporal tube of a queried object according to the given sentence. Currently, most existing grounding methods are restricted to well-aligned segment-sentence pairs. In this paper, we explore spatio-temporal video grounding on unaligned data and multi-form sentences. This challenging task requires to capture critical object relations to identify the queried target. However, existing approaches cannot distinguish notable objects and remain in ineffective relation modeling between unnecessary objects. Thus, we propose a novel object-aware multi-branch relation network for object-aware relation discovery. Concretely, we first devise multiple branches to develop object-aware region modeling, where each branch focuses on a crucial object mentioned in the sentence. We then propose multi-branch relation reasoning to capture critical object relationships between the main branch and auxiliary branches. Moreover, we apply a diversity loss to make each branch only pay attention to its corresponding object and boost multi-branch learning. The extensive experiments show the effectiveness of our proposed method.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes