CVSep 13, 2023

So you think you can track?

arXiv:2309.07268v123 citationsh-index: 53
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for traffic scene understanding, addressing the challenge of long-term tracking in dense, real-world highway environments.

The authors tackled the problem of multi-camera tracking in traffic scenes by introducing a large-scale dataset with 234 hours of video from overlapping cameras, and found that existing trackers perform poorly, achieving only 9.5% HOTA and 75.9% recall at IOU 0.1.

This work introduces a multi-camera tracking dataset consisting of 234 hours of video data recorded concurrently from 234 overlapping HD cameras covering a 4.2 mile stretch of 8-10 lane interstate highway near Nashville, TN. The video is recorded during a period of high traffic density with 500+ objects typically visible within the scene and typical object longevities of 3-15 minutes. GPS trajectories from 270 vehicle passes through the scene are manually corrected in the video data to provide a set of ground-truth trajectories for recall-oriented tracking metrics, and object detections are provided for each camera in the scene (159 million total before cross-camera fusion). Initial benchmarking of tracking-by-detection algorithms is performed against the GPS trajectories, and a best HOTA of only 9.5% is obtained (best recall 75.9% at IOU 0.1, 47.9 average IDs per ground truth object), indicating the benchmarked trackers do not perform sufficiently well at the long temporal and spatial durations required for traffic scene understanding.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes