CVDec 2, 2025

From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

Yuqing Shao, Yuchen Yang, Rui Yu, Weilong Li, Xu Guo, Huaicheng Yan, Wei Wang, Xiao Sun

arXiv:2512.02392v16.21 citationsh-index: 3Has Code

Originality Incremental advance

AI Analysis

This addresses the association bottleneck in multi-object tracking for applications like video analysis, though it is incremental as it builds on existing DETR-based architectures.

The paper tackles the problem of low association accuracy in end-to-end multi-object tracking methods by introducing FDTA, a feature refinement framework that enhances object embeddings through spatial, temporal, and identity adapters, achieving state-of-the-art performance on benchmarks like DanceTrack, SportsMOT, and BFT.

End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at https://github.com/Spongebobbbbbbbb/FDTA.

View on arXiv PDF Code

Similar