Architecture and evaluation protocol for transformer-based visual object tracking in UAV applications
This work addresses the problem of real-time visual object tracking for UAV applications, which is crucial for tasks such as surveillance and navigation.
The authors tackled the problem of visual object tracking from Unmanned Aerial Vehicles (UAVs) and achieved consistent improvements in Success and Normalized time to Failure (NT2F) metrics. Their Modular Asynchronous Tracking Architecture (MATA) showed better performance across multiple tracking processing frequencies.
Object tracking from Unmanned Aerial Vehicles (UAVs) is challenged by platform dynamics, camera motion, and limited onboard resources. Existing visual trackers either lack robustness in complex scenarios or are too computationally demanding for real-time embedded use. We propose an Modular Asynchronous Tracking Architecture (MATA) that combines a transformer-based tracker with an Extended Kalman Filter, integrating ego-motion compensation from sparse optical flow and an object trajectory model. We further introduce a hardware-independent, embedded oriented evaluation protocol and a new metric called Normalized time to Failure (NT2F) to quantify how long a tracker can sustain a tracking sequence without external help. Experiments on UAV benchmarks, including an augmented UAV123 dataset with synthetic occlusions, show consistent improvements in Success and NT2F metrics across multiple tracking processing frequency. A ROS 2 implementation on a Nvidia Jetson AGX Orin confirms that the evaluation protocol more closely matches real-time performance on embedded systems.