CVAIFeb 4

DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

arXiv:2602.04692v12 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the challenge of accurately tracking objects based on language descriptions in interactive AI systems like robotics and autonomous driving, but it is incremental as it extends existing RMOT with depth data.

The paper tackles the problem of referring multi-object tracking (RMOT) by introducing a new task called RGBD Referring Multi-Object Tracking (DRMOT) that fuses RGB, depth, and language modalities to improve 3D-aware tracking, and they propose a dataset (DRSet) and framework (DRTrack) that show effectiveness in experiments.

Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes