CVFeb 18

Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

arXiv:2602.16160v11 citationsh-index: 15
Originality Incremental advance
AI Analysis

This work addresses computational cost for video tracking applications, offering an incremental improvement by enabling dynamic depth adaptation without modifying the underlying network architecture.

The paper tackles the computational inefficiency of transformer-based visual trackers by proposing UncL-STARK, which dynamically adapts encoder and decoder depths at inference time based on uncertainty estimates, achieving up to 12% GFLOPs reduction, 8.9% latency reduction, and 10.8% energy savings while maintaining tracking accuracy within 0.2% of the baseline.

Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes