CVJul 18, 2023

In Defense of Clip-based Video Relation Detection

Meng Wei, Long Chen, Wei Ji, Xiaoyu Yue, Roger Zimmermann

arXiv:2307.08984v15.97 citationsh-index: 34

Originality Incremental advance

AI Analysis

This work addresses video analysis for computer vision researchers, showing that clip-based methods can outperform video-based ones with proper context modeling, though it is incremental as it builds on existing paradigms.

The paper tackles video visual relation detection by revisiting clip-based approaches and proposing a Hierarchical Context Model that enriches spatial and temporal context, achieving state-of-the-art performance on two benchmarks.

Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries. Existing VidVRD methods can be broadly categorized into bottom-up and top-down paradigms, depending on their approach to classifying relations. Bottom-up methods follow a clip-based approach where they classify relations of short clip tubelet pairs and then merge them into long video relations. On the other hand, top-down methods directly classify long video tubelet pairs. While recent video-based methods utilizing video tubelets have shown promising results, we argue that the effective modeling of spatial and temporal context plays a more significant role than the choice between clip tubelets and video tubelets. This motivates us to revisit the clip-based paradigm and explore the key success factors in VidVRD. In this paper, we propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips. We demonstrate that using clip tubelets can achieve superior performance compared to most video-based methods. Additionally, using clip tubelets offers more flexibility in model designs and helps alleviate the limitations associated with video tubelets, such as the challenging long-term object tracking problem and the loss of temporal information in long-term tubelet feature compression. Extensive experiments conducted on two challenging VidVRD benchmarks validate that our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.

View on arXiv PDF

Similar