CVMar 10, 2022

Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking

arXiv:2203.05328v2307 citationsh-index: 98
Originality Incremental advance
AI Analysis

This work addresses the need for more general and efficient tracking systems by reducing architectural complexity, though it is incremental as it builds on existing transformer-based approaches.

The paper tackles the problem of simplifying visual object tracking architectures by proposing SimTrack, which uses a transformer backbone for joint feature extraction and interaction, eliminating the need for customized modules. It achieves a 2.5% AUC gain on LaSOT and 2.6% on TNL2K, delivering competitive results with specialized methods.

Exploiting a general-purpose neural architecture to replace hand-wired designs or inductive biases has recently drawn extensive interest. However, existing tracking approaches rely on customized sub-modules and need prior knowledge for architecture selection, hindering the tracking development in a more general system. This paper presents a Simplified Tracking architecture (SimTrack) by leveraging a transformer backbone for joint feature extraction and interaction. Unlike existing Siamese trackers, we serialize the input images and concatenate them directly before the one-branch backbone. Feature interaction in the backbone helps to remove well-designed interaction modules and produce a more efficient and effective framework. To reduce the information loss from down-sampling in vision transformers, we further propose a foveal window strategy, providing more diverse input patches with acceptable computational costs. Our SimTrack improves the baseline with 2.5%/2.6% AUC gains on LaSOT/TNL2K and gets results competitive with other specialized tracking algorithms without bells and whistles.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes