CVOct 4, 2025

Cross-View Open-Vocabulary Object Detection in Aerial Imagery

arXiv:2510.03858v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This enables more flexible and scalable object detection for aerial imagery applications, though it is an incremental adaptation of existing methods to a new domain.

The paper tackles the problem of adapting open-vocabulary object detection from ground-view to aerial imagery by proposing a framework with contrastive image-to-image alignment and multi-instance vocabulary associations, achieving improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone, and +3.46 mAP on HRRSD in zero-shot settings.

Traditional object detection models are typically trained on a fixed set of classes, limiting their flexibility and making it costly to incorporate new categories. Open-vocabulary object detection addresses this limitation by enabling models to identify unseen classes without explicit training. Leveraging pretrained models contrastively trained on abundantly available ground-view image-text classification pairs provides a strong foundation for open-vocabulary object detection in aerial imagery. Domain shifts, viewpoint variations, and extreme scale differences make direct knowledge transfer across domains ineffective, requiring specialized adaptation strategies. In this paper, we propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery through structured domain alignment. The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings and employs multi-instance vocabulary associations to align aerial images with text embeddings. Extensive experiments on the xView, DOTAv2, VisDrone, DIOR, and HRRSD datasets are used to validate our approach. Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting when compared to finetuned closed-vocabulary dataset-specific model performance, thus paving the way for more flexible and scalable object detection systems in aerial applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes