DecoderTracker: Decoder-Only Method for Multiple-Object Tracking
This work addresses the problem of slow inference and training times in multi-object tracking for applications like video analysis, though it is incremental as it builds on existing decoder-only and transformer methods.
The paper tackled the computational inefficiency and optimization challenges of existing transformer-based multi-object tracking methods by proposing DecoderTracker, which achieved 2 to 3 times faster inference than MOTR while outperforming it on multiple benchmarks.
Decoder-only methods, such as GPT, have demonstrated superior performance in many areas compared to traditional encoder-decoder structure transformer methods. Over the years, end-to-end methods based on the traditional transformer structure, like MOTR, have achieved remarkable performance in multi-object tracking. However,the substantial computational resource consumption of these methods, coupled with the optimization challenges posed by dynamic data, results in less favorable inference speeds and training times. To address the aforementioned issues, this paper optimized the network architecture and proposed an effective training strategy to mitigate the problem of prolonged training times, thereby developing DecoderTracker, a novel end-to-end tracking method. Subsequently, to tackle the optimization challenges arising from dynamic data, this paper introduced DecoderTracker+ by incorporating a Fixed-Size Query Memory and refining certain attention layers. Our methods, without any bells and whistles, outperforms MOTR on multiple benchmarks, \textcolor{black}{featuring a 2 to 3 times faster inference than MOTR}, respectively. The proposed method is implemented in open-source code, accessible at https://github.com/liaopan-lp/MO-YOLO.