Multi-Object Tracking as Attention Mechanism
This work addresses the need for efficient real-time tracking in computer vision applications, offering a novel approach that reduces overhead and maintains robustness as instance count increases.
The paper tackles the problem of high computational cost in multi-object tracking (MOT) by proposing TicrossNet, a simple and fast end-to-end model that eliminates traditional modules like Kalman filters and achieves real-time performance with 32.6 FPS on MOT17 and 31.0 FPS on MOT20, handling over 100 instances per frame.
We propose a conceptually simple and thus fast multi-object tracking (MOT) model that does not require any attached modules, such as the Kalman filter, Hungarian algorithm, transformer blocks, or graph networks. Conventional MOT models are built upon the multi-step modules listed above, and thus the computational cost is high. Our proposed end-to-end MOT model, \textit{TicrossNet}, is composed of a base detector and a cross-attention module only. As a result, the overhead of tracking does not increase significantly even when the number of instances ($N_t$) increases. We show that TicrossNet runs \textit{in real-time}; specifically, it achieves 32.6 FPS on MOT17 and 31.0 FPS on MOT20 (Tesla V100), which includes as many as $>$100 instances per frame. We also demonstrate that TicrossNet is robust to $N_t$; thus, it does not have to change the size of the base detector, depending on $N_t$, as is often done by other models for real-time processing.