CVSep 6, 2023

Towards Efficient Training with Negative Samples in Visual Tracking

arXiv:2309.02903v11 citationsh-index: 4
Originality Incremental advance
AI Analysis

It addresses efficiency and overfitting issues for visual tracking researchers and practitioners, offering a novel method that is incremental in improving training strategies.

This paper tackles the problem of overfitting and high computational costs in visual object tracking by introducing a training strategy that balances negative and positive samples from the start, achieving 75.8% AO on GOT-10k and 84.1% AUC on TrackingNet with half the training data of previous methods.

Current state-of-the-art (SOTA) methods in visual object tracking often require extensive computational resources and vast amounts of training data, leading to a risk of overfitting. This study introduces a more efficient training strategy to mitigate overfitting and reduce computational requirements. We balance the training process with a mix of negative and positive samples from the outset, named as Joint learning with Negative samples (JN). Negative samples refer to scenarios where the object from the template is not present in the search region, which helps to prevent the model from simply memorizing the target, and instead encourages it to use the template for object location. To handle the negative samples effectively, we adopt a distribution-based head, which modeling the bounding box as distribution of distances to express uncertainty about the target's location in the presence of negative samples, offering an efficient way to manage the mixed sample training. Furthermore, our approach introduces a target-indicating token. It encapsulates the target's precise location within the template image. This method provides exact boundary details with negligible computational cost but improving performance. Our model, JN-256, exhibits superior performance on challenging benchmarks, achieving 75.8% AO on GOT-10k and 84.1% AUC on TrackingNet. Notably, JN-256 outperforms previous SOTA trackers that utilize larger models and higher input resolutions, even though it is trained with only half the number of data sampled used in those works.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes