CVAIOct 11, 2024

VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking

arXiv:2410.08529v15 citationsh-index: 15
Originality Highly original
AI Analysis

It addresses the challenge of detecting and tracking diverse object categories in videos, including unseen classes, which is critical for applications like autonomous systems and video analysis.

The paper tackles open-vocabulary multi-object tracking by integrating object states and video-centric training, resulting in a state-of-the-art method that outperforms existing approaches.

Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens. In this paper, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate localization and classification (detection) of the time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object association (tracking). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for open-vocabulary tracking task.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes