Local Perception-Aware Transformer for Aerial Tracking
This work addresses the challenge of improving tracking precision for aerial robots, though it appears incremental by modifying existing Transformer methods for a specific domain.
The paper tackles the problem of Transformer-based visual object tracking lacking inductive bias and local detail modeling in aerial robots by introducing a local-recognition encoder with attention and correction mechanisms. It achieves competitive accuracy and robustness on aerial benchmarks with 316 sequences, validated in real-world tests.
Transformer-based visual object tracking has been utilized extensively. However, the Transformer structure is lack of enough inductive bias. In addition, only focusing on encoding the global feature does harm to modeling local details, which restricts the capability of tracking in aerial robots. Specifically, with local-modeling to global-search mechanism, the proposed tracker replaces the global encoder by a novel local-recognition encoder. In the employed encoder, a local-recognition attention and a local element correction network are carefully designed for reducing the global redundant information interference and increasing local inductive bias. Meanwhile, the latter can model local object details precisely under aerial view through detail-inquiry net. The proposed method achieves competitive accuracy and robustness in several authoritative aerial benchmarks with 316 sequences in total. The proposed tracker's practicability and efficiency have been validated by the real-world tests.