CV CLFeb 27

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin, Yunpu Ma, Xiaoyu Shen

arXiv:2602.23734v12.82 citationsh-index: 24Has Code

Originality Highly original

AI Analysis

This work addresses the real-time deployment problem for visual object tracking researchers and practitioners by providing a more efficient and accurate method, though it is incremental as it builds on existing token pruning approaches.

The paper tackled the computational inefficiency of one-stream Transformer-based visual trackers by introducing UTPTrack, a unified token pruning framework that jointly compresses all three components, achieving 65.4% token pruning in RGB tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance.

One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at https://github.com/EIT-NLP/UTPTrack.

View on arXiv PDF Code

Similar