So you think you can track?
This provides a new benchmark for traffic scene understanding, addressing the challenge of long-term tracking in dense, real-world highway environments.
The authors tackled the problem of multi-camera tracking in traffic scenes by introducing a large-scale dataset with 234 hours of video from overlapping cameras, and found that existing trackers perform poorly, achieving only 9.5% HOTA and 75.9% recall at IOU 0.1.
This work introduces a multi-camera tracking dataset consisting of 234 hours of video data recorded concurrently from 234 overlapping HD cameras covering a 4.2 mile stretch of 8-10 lane interstate highway near Nashville, TN. The video is recorded during a period of high traffic density with 500+ objects typically visible within the scene and typical object longevities of 3-15 minutes. GPS trajectories from 270 vehicle passes through the scene are manually corrected in the video data to provide a set of ground-truth trajectories for recall-oriented tracking metrics, and object detections are provided for each camera in the scene (159 million total before cross-camera fusion). Initial benchmarking of tracking-by-detection algorithms is performed against the GPS trajectories, and a best HOTA of only 9.5% is obtained (best recall 75.9% at IOU 0.1, 47.9 average IDs per ground truth object), indicating the benchmarked trackers do not perform sufficiently well at the long temporal and spatial durations required for traffic scene understanding.