CVAIIRSep 14, 2022

Transformers and CNNs both Beat Humans on SBIR

arXiv:2209.06629v15 citationsh-index: 42
Originality Incremental advance
AI Analysis

This work improves SBIR, a task with broad applications in image retrieval, by introducing models that first outperform human performance on a large-scale benchmark, though it is incremental in refining existing methods.

The paper tackled the problem of sketch-based image retrieval (SBIR) by addressing a persistent invariance to horizontal flip that harms performance, and showed that vision transformers outperform CNNs with a large margin, achieving a recall of 62.25% (at k=1) on the Sketchy benchmark compared to previous state-of-the-art methods at 46.2%.

Sketch-based image retrieval (SBIR) is the task of retrieving natural images (photos) that match the semantics and the spatial configuration of hand-drawn sketch queries. The universality of sketches extends the scope of possible applications and increases the demand for efficient SBIR solutions. In this paper, we study classic triplet-based SBIR solutions and show that a persistent invariance to horizontal flip (even after model finetuning) is harming performance. To overcome this limitation, we propose several approaches and evaluate in depth each of them to check their effectiveness. Our main contributions are twofold: We propose and evaluate several intuitive modifications to build SBIR solutions with better flip equivariance. We show that vision transformers are more suited for the SBIR task, and that they outperform CNNs with a large margin. We carried out numerous experiments and introduce the first models to outperform human performance on a large-scale SBIR benchmark (Sketchy). Our best model achieves a recall of 62.25% (at k = 1) on the sketchy benchmark compared to previous state-of-the-art methods 46.2%.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes