CVMar 10, 2025

Just Functioning as a Hook for Two-Stage Referring Multi-Object Tracking

arXiv:2503.07516v31 citationsh-index: 9
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in RMOT for video analysis applications, representing an incremental advancement in the field.

The paper tackles the problem of insufficient modeling of subtask interactions in two-stage Referring Multi-Object Tracking (RMOT) by proposing JustHook, a framework with a Hook module and Parallel Combined Decoder, which achieves state-of-the-art performance with a +6.9% improvement in HOTA on Refer-KITTI-V2.

Referring Multi-Object Tracking (RMOT) aims to localize target trajectories in videos specified by natural language expressions. Despite recent progress, the intrinsic relationship between the two subtasks of tracking and referring in RMOT has not been fully studied. In this paper, we present a systematic analysis of their interdependence, revealing that current two-stage Referring-by-Tracking (RBT) frameworks remain fundamentally limited by insufficient modeling of subtask interactions and inflexible reliance on semantic alignment modules like CLIP. To this end, we propose JustHook, a novel two-stage RBT framework where a Hook module is firstly designed to redefine the linkage between subtasks. The Hook is built centered on grid sampling at the feature-level and is used for context-aware target feature extraction. Moreover, we propose a Parallel Combined Decoder (PCD) that learns in a unified joint feature space rather than relying on pre-defined cross-modal embeddings. Our design not only enhances the interpretability and modularity but also significantly improves the generalization. Extensive experiments on Refer-KITTI, Refer-KITTI-V2, and Refer-Dance demonstrate that JustHook achieves state-of-the-art performance, improving the HOTA by +6.9\% on Refer-KITTI-V2 with superior efficiency. Code will be available soon.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes