CVSep 6, 2023

Efficient Training for Visual Tracking with Deformable Transformer

arXiv:2309.02676v12.84 citationsh-index: 4

Originality Incremental advance

AI Analysis

This work addresses efficiency issues in visual object tracking for real-world applications, representing an incremental improvement over existing methods.

The paper tackles the problem of resource-intensive training and inference in Transformer-based visual tracking models by introducing DETRack, which uses a deformable transformer decoder and novel training techniques to reduce GFLOPs and training epochs, achieving 72.9% AO on GOT-10k with only 20% of the baseline training epochs.

Recent Transformer-based visual tracking models have showcased superior performance. Nevertheless, prior works have been resource-intensive, requiring prolonged GPU training hours and incurring high GFLOPs during inference due to inefficient training methods and convolution-based target heads. This intensive resource use renders them unsuitable for real-world applications. In this paper, we present DETRack, a streamlined end-to-end visual object tracking framework. Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head, achieves higher sparsity than traditional convolution heads, resulting in decreased GFLOPs. For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique, significantly accelerating model's convergence. Comprehensive experiments affirm the effectiveness and efficiency of our proposed method. For instance, DETRack achieves 72.9% AO on challenging GOT-10k benchmarks using only 20% of the training epochs required by the baseline, and runs with lower GFLOPs than all the transformer-based trackers.

View on arXiv PDF

Similar