Learning an Adaptive and View-Invariant Vision Transformer for Real-Time UAV Tracking
This work addresses real-time tracking for unmanned aerial vehicles, an incremental improvement in efficiency for a domain-specific application.
The authors tackled the problem of real-time UAV tracking by proposing AVTrack, an adaptive computation framework that selectively activates transformer blocks and learns view-invariant representations, and AVTrack-MD, an enhanced version using multi-teacher knowledge distillation, which achieved a 17% speed boost while maintaining comparable performance.
Transformer-based models have improved visual tracking, but most still cannot run in real time on resource-limited devices, especially for unmanned aerial vehicle (UAV) tracking. To achieve a better balance between performance and efficiency, we propose AVTrack, an adaptive computation tracking framework that adaptively activates transformer blocks through an Activation Module (AM), which dynamically optimizes the ViT architecture by selectively engaging relevant components. To address extreme viewpoint variations, we propose to learn view-invariant representations via mutual information (MI) maximization. In addition, we propose AVTrack-MD, an enhanced tracker incorporating a novel MI maximization-based multi-teacher knowledge distillation framework. Leveraging multiple off-the-shelf AVTrack models as teachers, we maximize the MI between their aggregated softened features and the corresponding softened feature of the student model, improving the generalization and performance of the student, especially under noisy conditions. Extensive experiments show that AVTrack-MD achieves performance comparable to AVTrack's performance while reducing model complexity and boosting average tracking speed by over 17\%. Codes is available at: https://github.com/wuyou3474/AVTrack.