Multiple People Tracking Using Hierarchical Deep Tracklet Re-identification
This work addresses the challenge of long-term occlusions in crowded scenes for video surveillance or autonomous systems, representing an incremental improvement over existing tracking-by-detection approaches.
The paper tackles the problem of multiple people tracking in monocular videos by proposing a hierarchical clustering framework that uses a novel multi-stage deep network for tracklet re-identification, resulting in significant outperformance over state-of-the-art methods on MOT16 and MOT17 benchmarks.
The task of multiple people tracking in monocular videos is challenging because of the numerous difficulties involved: occlusions, varying environments, crowded scenes, camera parameters and motion. In the tracking-by-detection paradigm, most approaches adopt person re-identification techniques based on computing the pairwise similarity between detections. However, these techniques are less effective in handling long-term occlusions. By contrast, tracklet (a sequence of detections) re-identification can improve association accuracy since tracklets offer a richer set of visual appearance and spatio-temporal cues. In this paper, we propose a tracking framework that employs a hierarchical clustering mechanism for merging tracklets. To this end, tracklet re-identification is performed by utilizing a novel multi-stage deep network that can jointly reason about the visual appearance and spatio-temporal properties of a pair of tracklets, thereby providing a robust measure of affinity. Experimental results on the challenging MOT16 and MOT17 benchmarks show that our method significantly outperforms state-of-the-arts.