CVFeb 27, 2024

ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking

arXiv:2403.07914v12.01 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses the problem of improving spatio-temporal modeling in visual object tracking for researchers and practitioners, but it is incremental as it builds on existing pre-trained models with additive components.

The paper tackles the challenge of efficiently modeling spatio-temporal relations in visual object tracking by introducing ACTrack, a framework that uses a frozen pre-trained Transformer backbone and a trainable lightweight additive net, achieving a balance between training efficiency and tracking performance as demonstrated on benchmarks.

Efficiently modeling spatio-temporal relations of objects is a key challenge in visual object tracking (VOT). Existing methods track by appearance-based similarity or long-term relation modeling, resulting in rich temporal contexts between consecutive frames being easily overlooked. Moreover, training trackers from scratch or fine-tuning large pre-trained models needs more time and memory consumption. In this paper, we present ACTrack, a new tracking framework with additive spatio-temporal conditions. It preserves the quality and capabilities of the pre-trained Transformer backbone by freezing its parameters, and makes a trainable lightweight additive net to model spatio-temporal relations in tracking. We design an additive siamese convolutional network to ensure the integrity of spatial features and perform temporal sequence modeling to simplify the tracking pipeline. Experimental results on several benchmarks prove that ACTrack could balance training efficiency and tracking performance.

View on arXiv PDF

Similar