CVLGOct 19, 2022

Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval

arXiv:2210.10486v17 citationsh-index: 50Has Code
Originality Highly original
AI Analysis

This addresses the problem of improving retrieval accuracy in cross-modal tasks for computer vision applications, representing an incremental advance with a novel method.

The paper tackles fine-grained sketch-based image retrieval by proposing a cross-attention framework that fuses modality-specific information from photos and sketches, achieving state-of-the-art results on benchmarks like Shoe-V2, Chair-V2, and Sketchy.

Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information. As instances from different modalities can often provide complementary information describing the underlying concept, we propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation. Such encoders can then be applied to downstream tasks like cross-modal retrieval. We demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art results on three fine-grained sketch-based image retrieval benchmarks: Shoe-V2, Chair-V2 and Sketchy. Implementation is available at https://github.com/abhrac/xmodal-vit.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes