MOT FCG++: Enhanced Representation of Spatio-temporal Motion and Appearance Features
This work addresses the challenge of robust feature representation for tracking objects across frames in computer vision, but it is incremental as it builds upon an existing baseline model.
The paper tackled the problem of representing spatio-temporal motion and appearance features in multi-object tracking by proposing enhancements to the MOT FCG method, resulting in improved performance with scores of 63.1 HOTA, 76.9 MOTA, and 78.2 IDF1 on the MOT17 test set.
The goal of multi-object tracking (MOT) is to detect and track all objects in a scene across frames, while maintaining a unique identity for each object. Most existing methods rely on the spatial-temporal motion features and appearance embedding features of the detected objects in consecutive frames. Effectively and robustly representing the spatial and appearance features of long trajectories has become a critical factor affecting the performance of MOT. We propose a novel approach for appearance and spatial-temporal motion feature representation, improving upon the hierarchical clustering association method MOT FCG. For spatialtemporal motion features, we first propose Diagonal Modulated GIoU, which more accurately represents the relationship between the position and shape of the objects. Second, Mean Constant Velocity Modeling is proposed to reduce the effect of observation noise on target motion state estimation. For appearance features, we utilize a dynamic appearance representation that incorporates confidence information, enabling the trajectory appearance features to be more robust and global. Based on the baseline model MOT FCG, we have realized further improvements in the performance of all. we achieved 63.1 HOTA, 76.9 MOTA and 78.2 IDF1 on the MOT17 test set, and also achieved competitive performance on the MOT20 and DanceTrack sets.