CVMar 9, 2025

Dynamic Updates for Language Adaptation in Visual-Language Tracking

Xiaohai Li, Bineng Zhong, Qihua Liang, Zhiyi Mo, Jian Nong, Shuxiang Song

arXiv:2503.06621v120 citationsh-index: 15Has CodeCVPR

Originality Highly original

AI Analysis

This work addresses robustness issues in visual-language tracking for applications like surveillance and autonomous systems, representing a novel method for a known bottleneck rather than an incremental improvement.

The paper tackles the problem of semantic discrepancies in visual-language tracking caused by static multi-modal references by proposing DUTrack, a framework that dynamically updates language descriptions and templates to maintain consistency with tracked objects. DUTrack achieves state-of-the-art performance on six tracking benchmarks, including LaSOT, LaSOT_ext, TNL2K, OTB99-Lang, GOT-10K, and UAV123.

The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at https://github.com/GXNU-ZhongLab/DUTrack.

View on arXiv PDF Code

Similar