Composed Image Retrieval for Remote Sensing
This addresses a gap in remote sensing image retrieval by enhancing query flexibility for users in fields like environmental monitoring or urban planning, though it is incremental as it adapts an existing paradigm to a new domain.
The paper introduces composed image retrieval to remote sensing, enabling queries using image examples combined with textual descriptions to modify attributes like shape or color, and demonstrates that a vision-language model achieves state-of-the-art results without additional training.
This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-language model possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir