Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images
This addresses the issue of unstable object detection in remote sensing for applications like environmental monitoring, but it is incremental as it builds on existing open-vocabulary methods by adding multimodal prompting.
The paper tackled the problem of unreliable category specification in open-vocabulary object detection for remote sensing images by proposing RS-MPOD, a framework that uses multimodal prompting with visual and textual cues, resulting in more stable performance under semantic ambiguity and distribution shifts as shown in experiments on standard benchmarks.
Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.