CVSep 19, 2024

End-to-end Open-vocabulary Video Visual Relationship Detection using Multi-modal Prompting

arXiv:2409.12499v24 citationsh-index: 4
AI Analysis

This work addresses a limitation in video analysis for computer vision researchers, offering an incremental improvement by reducing dependence on pre-trained trajectory detectors to better handle novel object categories.

The paper tackles the problem of open-vocabulary video visual relationship detection, which involves detecting unseen relationships between objects in videos beyond annotated categories, by proposing an end-to-end framework that unifies object trajectory detection and relationship classification. The result is a method that outperforms existing approaches on VidVRD and VidOR datasets and demonstrates strong generalization in cross-dataset scenarios.

Open-vocabulary video visual relationship detection aims to expand video visual relationship detection beyond annotated categories by detecting unseen relationships between both seen and unseen objects in videos. Existing methods usually use trajectory detectors trained on closed datasets to detect object trajectories, and then feed these trajectories into large-scale pre-trained vision-language models to achieve open-vocabulary classification. Such heavy dependence on the pre-trained trajectory detectors limits their ability to generalize to novel object categories, leading to performance degradation. To address this challenge, we propose to unify object trajectory detection and relationship classification into an end-to-end open-vocabulary framework. Under this framework, we propose a relationship-aware open-vocabulary trajectory detector. It primarily consists of a query-based Transformer decoder, where the visual encoder of CLIP is distilled for frame-wise open-vocabulary object detection, and a trajectory associator. To exploit relationship context during trajectory detection, a relationship query is embedded into the Transformer decoder, and accordingly, an auxiliary relationship loss is designed to enable the decoder to perceive the relationships between objects explicitly. Moreover, we propose an open-vocabulary relationship classifier that leverages the rich semantic knowledge of CLIP to discover novel relationships. To adapt CLIP well to relationship classification, we design a multi-modal prompting method that employs spatio-temporal visual prompting for visual representation and vision-guided language prompting for language input. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our framework. Our framework is also applied to a more difficult cross-dataset scenario to further demonstrate its generalization ability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes