Multiple Object Tracking from appearance by hierarchically clustering tracklets
This work addresses the problem of tracking multiple objects in videos for applications like surveillance and sports analysis, presenting an incremental improvement by focusing on appearance-based methods.
The paper tackles multiple object tracking by using object appearance as the primary cue for association, with spatial and temporal priors as weights, and achieves competitive results on MOT17 and MOT20 benchmarks while setting state-of-the-art performance on DanceTrack.
Current approaches in Multiple Object Tracking (MOT) rely on the spatio-temporal coherence between detections combined with object appearance to match objects from consecutive frames. In this work, we explore MOT using object appearances as the main source of association between objects in a video, using spatial and temporal priors as weighting factors. We form initial tracklets by leveraging on the idea that instances of an object that are close in time should be similar in appearance, and build the final object tracks by fusing the tracklets in a hierarchical fashion. We conduct extensive experiments that show the effectiveness of our method over three different MOT benchmarks, MOT17, MOT20, and DanceTrack, being competitive in MOT17 and MOT20 and establishing state-of-the-art results in DanceTrack.