Efficient Training for Visual Tracking with Deformable Transformer
This work addresses efficiency issues in visual object tracking for real-world applications, representing an incremental improvement over existing methods.
The paper tackles the problem of resource-intensive training and inference in Transformer-based visual tracking models by introducing DETRack, which uses a deformable transformer decoder and novel training techniques to reduce GFLOPs and training epochs, achieving 72.9% AO on GOT-10k with only 20% of the baseline training epochs.
Recent Transformer-based visual tracking models have showcased superior performance. Nevertheless, prior works have been resource-intensive, requiring prolonged GPU training hours and incurring high GFLOPs during inference due to inefficient training methods and convolution-based target heads. This intensive resource use renders them unsuitable for real-world applications. In this paper, we present DETRack, a streamlined end-to-end visual object tracking framework. Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head, achieves higher sparsity than traditional convolution heads, resulting in decreased GFLOPs. For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique, significantly accelerating model's convergence. Comprehensive experiments affirm the effectiveness and efficiency of our proposed method. For instance, DETRack achieves 72.9% AO on challenging GOT-10k benchmarks using only 20% of the training epochs required by the baseline, and runs with lower GFLOPs than all the transformer-based trackers.