CVAug 17, 2025

CLAIR: CLIP-Aided Weakly Supervised Zero-Shot Cross-Domain Image Retrieval

arXiv:2508.12290v1h-index: 4
Originality Incremental advance
AI Analysis

This work addresses image retrieval across domains with limited supervision, offering incremental improvements in a specific application area.

The paper tackles the problem of weakly supervised zero-shot cross-domain image retrieval using noisy pseudo-labels from CLIP, proposing CLAIR to refine labels and align features across domains, achieving superior performance on datasets like TUBerlin and Sketchy.

The recent growth of large foundation models that can easily generate pseudo-labels for huge quantity of unlabeled data makes unsupervised Zero-Shot Cross-Domain Image Retrieval (UZS-CDIR) less relevant. In this paper, we therefore turn our attention to weakly supervised ZS-CDIR (WSZS-CDIR) with noisy pseudo labels generated by large foundation models such as CLIP. To this end, we propose CLAIR to refine the noisy pseudo-labels with a confidence score from the similarity between the CLIP text and image features. Furthermore, we design inter-instance and inter-cluster contrastive losses to encode images into a class-aware latent space, and an inter-domain contrastive loss to alleviate domain discrepancies. We also learn a novel cross-domain mapping function in closed-form, using only CLIP text embeddings to project image features from one domain to another, thereby further aligning the image features for retrieval. Finally, we enhance the zero-shot generalization ability of our CLAIR to handle novel categories by introducing an extra set of learnable prompts. Extensive experiments are carried out using TUBerlin, Sketchy, Quickdraw, and DomainNet zero-shot datasets, where our CLAIR consistently shows superior performance compared to existing state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes