CVAIOct 20, 2022

General Image Descriptors for Open World Image Retrieval using ViT CLIP

arXiv:2210.11141v13 citationsh-index: 21
Originality Synthesis-oriented
AI Analysis

This work addresses the fundamental computer vision problem of open-world image retrieval for applications like search engines and e-commerce, but it is incremental as it builds on existing CLIP models with fine-tuning tricks.

The authors tackled the problem of multi-domain image retrieval in the wild by fine-tuning zero-shot Vision Transformers pre-trained with CLIP, achieving 4th place in the Google Universal Image Embedding Challenge.

The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc. This is a fundamental computer vision problem with notable applications in image retrieval, search engines and e-commerce. In this work, we explain our 4th place solution to the GUIE Challenge, and our "bag of tricks" to fine-tune zero-shot Vision Transformers (ViT) pre-trained using CLIP.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes