CVFeb 25

RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

arXiv:2602.22033v11 citationsh-index: 13
Originality Incremental advance
AI Analysis

This work addresses the challenge of all-day tracking for applications like surveillance by fusing RGB and thermal modalities, though it is incremental as it builds on existing referring tracking methods.

The authors tackled the problem of referring multi-object tracking in low-visibility conditions by proposing RT-RMOT, a new RGB-thermal task, and introduced the RefRT dataset with 166,147 triplets and the RTrack framework, which achieved improved tracking performance as demonstrated through experiments.

Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes