CVDec 1, 2024

MCBLT: Multi-Camera Multi-Object 3D Tracking in Long Videos

Yizhou Wang, Tim Meinhardt, Orcun Cetintas, Cheng-Yen Yang, Sameer Satish Pusegaonkar, Benjamin Missaoui, Sujit Biswas, Zheng Tang, Laura Leal-Taixé

arXiv:2412.00692v37.65 citationsh-index: 62025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)

Originality Highly original

AI Analysis

This addresses the need for robust 3D tracking in indoor environments like warehouses and hospitals, offering a novel approach that improves over traditional 2D-based methods.

The paper tackled the problem of multi-camera multi-object 3D tracking in long videos by proposing MCBLT, a framework that aggregates multi-view images into 3D detections and uses hierarchical GNNs for tracking, achieving state-of-the-art results with 81.22 HOTA on AICity'24 and 95.6 IDF1 on WildTrack.

Object perception from multi-view cameras is crucial for intelligent systems, particularly in indoor environments, e.g., warehouses, retail stores, and hospitals. Most traditional multi-target multi-camera (MTMC) detection and tracking methods rely on 2D object detection, single-view multi-object tracking (MOT), and cross-view re-identification (ReID) techniques, without properly handling important 3D information by multi-view image aggregation. In this paper, we propose a 3D object detection and tracking framework, named MCBLT, which first aggregates multi-view images with necessary camera calibration parameters to obtain 3D object detections in bird's-eye view (BEV). Then, we introduce hierarchical graph neural networks (GNNs) to track these 3D detections in BEV for MTMC tracking results. Unlike existing methods, MCBLT has impressive generalizability across different scenes and diverse camera settings, with exceptional capability for long-term association handling. As a result, our proposed MCBLT establishes a new state-of-the-art on the AICity'24 dataset with $81.22$ HOTA, and on the WildTrack dataset with $95.6$ IDF1.

View on arXiv PDF

Similar