You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
This work addresses the challenge of enabling ever-finer queries in image retrieval for users, though it appears incremental by building on existing CLIP models.
The paper tackles the problem of fine-grained image retrieval by combining sketch and text modalities, achieving precise retrievals that incorporate attributes like color and contextual cues previously unattainable with sketches alone.
Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composed image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios.