RT-OVAD: Real-Time Open-Vocabulary Aerial Object Detection via Image-Text Collaboration
This addresses the need for real-time, open-vocabulary detection in aerial scenes, enabling broader applications in surveillance and monitoring, though it builds incrementally on existing open-vocabulary detection approaches.
The paper tackles the problem of limited applicability in aerial object detection by extending it to open scenarios through image-text collaboration, proposing RT-OVAD, which achieves state-of-the-art performance with improvements of up to 7.0 mAP and runs at 34 FPS, three times faster than prior methods.
Aerial object detection plays a crucial role in numerous applications. However, most existing methods focus on detecting predefined object categories, limiting their applicability in real-world open scenarios. In this paper, we extend aerial object detection to open scenarios through image-text collaboration and propose RT-OVAD, the first real-time open-vocabulary detector for aerial scenes. Specifically, we first introduce an image-to-text alignment loss to replace the conventional category regression loss, thereby eliminating category constraints. Next, we propose a lightweight image-text collaboration strategy comprising an image-text collaboration encoder and a text-guided decoder. The encoder simultaneously enhances visual features and refines textual embeddings, while the decoder guides object queries to focus on class-relevant image features. This design further improves detection accuracy without incurring significant computational overhead. Extensive experiments demonstrate that RT-OVAD consistently outperforms existing state-of-the-art methods across open-vocabulary, zero-shot, and traditional closed-set detection tasks. For instance, on the open-vocabulary aerial detection benchmarks DIOR, DOTA-v2.0, and LAE-80C, RT-OVAD achieves 87.7 AP$_{50}$, 53.8 mAP, and 23.7 mAP, respectively, surpassing the previous state-of-the-art (LAE-DINO) by 2.2, 7.0, and 3.5 points. In addition, RT-OVAD achieves an inference speed of 34 FPS on an RTX 4090 GPU, approximately three times faster than LAE-DINO (10 FPS), meeting the real-time detection requirements of diverse applications. The code will be released at https://github.com/GT-Wei/RT-OVAD.