CVJul 29, 2025

Decoupled Spatio-Temporal Consistency Learning for Self-Supervised Tracking

arXiv:2507.21606v147 citationsh-index: 15Has CodeAAAI
Originality Highly original
AI Analysis

This work addresses the scalability and diversity limitations in tracking datasets for researchers and practitioners by providing a self-supervised alternative to manual annotation.

The paper tackles the problem of reducing reliance on manual box annotations in visual tracking by proposing a self-supervised framework that learns tracking representations through decoupled spatio-temporal consistency and instance contrastive loss, achieving improvements of over 25.3%, 20.4%, and 14.8% in AUC scores on GOT10K, LaSOT, and TrackingNet datasets compared to state-of-the-art methods.

The success of visual tracking has been largely driven by datasets with manual box annotations. However, these box annotations require tremendous human effort, limiting the scale and diversity of existing tracking datasets. In this work, we present a novel Self-Supervised Tracking framework named \textbf{\tracker}, designed to eliminate the need of box annotations. Specifically, a decoupled spatio-temporal consistency training framework is proposed to learn rich target information across timestamps through global spatial localization and local temporal association. This allows for the simulation of appearance and motion variations of instances in real-world scenarios. Furthermore, an instance contrastive loss is designed to learn instance-level correspondences from a multi-view perspective, offering robust instance supervision without additional labels. This new design paradigm enables {\tracker} to effectively learn generic tracking representations in a self-supervised manner, while reducing reliance on extensive box annotations. Extensive experiments on nine benchmark datasets demonstrate that {\tracker} surpasses \textit{SOTA} self-supervised tracking methods, achieving an improvement of more than 25.3\%, 20.4\%, and 14.8\% in AUC (AO) score on the GOT10K, LaSOT, TrackingNet datasets, respectively. Code: https://github.com/GXNU-ZhongLab/SSTrack.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes