Spatial-Temporal Relation Networks for Multi-Object Tracking
This work addresses the challenge of integrating multiple cues in a unified network for MOT, which is incremental as it builds on existing methods but simplifies training and improves performance.
The paper tackled the problem of combining heterogeneous cues like appearance, location, and topology for robust similarity scoring in multiple object tracking, achieving state-of-the-art accuracy on MOT15-17 benchmarks with public detection and online settings.
Recent progress in multiple object tracking (MOT) has shown that a robust similarity score is key to the success of trackers. A good similarity score is expected to reflect multiple cues, e.g. appearance, location, and topology, over a long period of time. However, these cues are heterogeneous, making them hard to be combined in a unified network. As a result, existing methods usually encode them in separate networks or require a complex training approach. In this paper, we present a unified framework for similarity measurement which could simultaneously encode various cues and perform reasoning across both spatial and temporal domains. We also study the feature representation of a tracklet-object pair in depth, showing a proper design of the pair features can well empower the trackers. The resulting approach is named spatial-temporal relation networks (STRN). It runs in a feed-forward way and can be trained in an end-to-end manner. The state-of-the-art accuracy was achieved on all of the MOT15-17 benchmarks using public detection and online settings.