CVOct 4, 2025

Cross-View Open-Vocabulary Object Detection in Aerial Imagery

arXiv:2510.03858v13.61 citationsh-index: 5

Originality Incremental advance

AI Analysis

This enables more flexible and scalable object detection for aerial imagery applications, though it is an incremental adaptation of existing methods to a new domain.

The paper tackles the problem of adapting open-vocabulary object detection from ground-view to aerial imagery by proposing a framework with contrastive image-to-image alignment and multi-instance vocabulary associations, achieving improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone, and +3.46 mAP on HRRSD in zero-shot settings.

Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling models to identify unseen classes without explicit training. Leveraging pretrained models contrastively trained on abundantly available ground-view image-text classification pairs provides a strong foundation for open-vocabulary object detection in aerial imagery. Domain shifts, viewpoint variations, and extreme scale differences make direct knowledge transfer across domains ineffective, requiring specialized adaptation strategies. In this paper, we propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery through structured domain alignment. The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings and employs multi-instance vocabulary associations to align aerial images with text embeddings. Extensive experiments on the xView, DOTAv2, VisDrone, DIOR, and HRRSD datasets are used to validate our approach. Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting when compared to finetuned closed-vocabulary dataset-specific model performance, thus paving the way for more flexible and scalable object detection systems in aerial applications.

View on arXiv PDF

Similar