A Multimodal Approach for Cross-Domain Image Retrieval
This addresses the problem of matching images across different visual domains like sketches and photographs for computer vision researchers, offering a novel unsupervised approach that reduces reliance on labeled data.
The paper tackled cross-domain image retrieval by introducing an unsupervised method that uses generated image captions as a domain-agnostic representation, achieving state-of-the-art performance with improvements of 24.0% on Office-Home and 132.2% on DomainNet over previous methods.
Cross-Domain Image Retrieval (CDIR) is a challenging task in computer vision, aiming to match images across different visual domains such as sketches, paintings, and photographs. Traditional approaches focus on visual image features and rely heavily on supervised learning with labeled data and cross-domain correspondences, which leads to an often struggle with the significant domain gap. This paper introduces a novel unsupervised approach to CDIR that incorporates textual context by leveraging pre-trained vision-language models. Our method, dubbed as Caption-Matching (CM), uses generated image captions as a domain-agnostic intermediate representation, enabling effective cross-domain similarity computation without the need for labeled data or fine-tuning. We evaluate our method on standard CDIR benchmark datasets, demonstrating state-of-the-art performance in unsupervised settings with improvements of 24.0% on Office-Home and 132.2% on DomainNet over previous methods. We also demonstrate our method's effectiveness on a dataset of AI-generated images from Midjourney, showcasing its ability to handle complex, multi-domain queries.