MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing
This addresses the problem of accurately localizing objects based on textual descriptions in remote sensing images, which is incremental as it builds on existing methods with a novel architecture.
The paper tackles visual grounding in remote sensing imagery by integrating object detection and visual grounding into a unified framework, achieving significant improvements over state-of-the-art methods on the OPT-RSVG and DIOR-RSVG datasets.
We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.