Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos
This work addresses multi-object tracking challenges in UAV videos for applications like surveillance and monitoring, representing an incremental improvement by integrating appearance and motion cues more effectively.
The paper tackled the problem of multi-object tracking in UAV-captured videos, where viewpoint changes and motion dynamics cause unstable affinity measurements, by proposing AMOT, which jointly exploits appearance and motion cues through an Appearance-Motion Consistency matrix and a Motion-aware Track Continuation module, achieving state-of-the-art performance on benchmarks like VisDrone2019, UAVDT, and VT-MOT-UAV.
Multi-object tracking (MOT) aims to track multiple objects while maintaining consistent identities across frames of a given video. In unmanned aerial vehicle (UAV) recorded videos, frequent viewpoint changes and complex UAV-ground relative motion dynamics pose significant challenges, which often lead to unstable affinity measurement and ambiguous association. Existing methods typically model motion and appearance cues separately, overlooking their spatio-temporal interplay and resulting in suboptimal tracking performance. In this work, we propose AMOT, which jointly exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module. Specifically, the AMC matrix computes bi-directional spatial consistency under the guidance of appearance features, enabling more reliable and context-aware identity association. The MTC module complements AMC by reactivating unmatched tracks through appearance-guided predictions that align with Kalman-based predictions, thereby reducing broken trajectories caused by missed detections. Extensive experiments on three UAV benchmarks, including VisDrone2019, UAVDT, and VT-MOT-UAV, demonstrate that our AMOT outperforms current state-of-the-art methods and generalizes well in a plug-and-play and training-free manner.