CVAIFeb 17, 2021

I Want This Product but Different : Multimodal Retrieval with Synthetic Query Expansion

arXiv:2102.08871v26 citations
AI Analysis

This addresses the need for customizable and explainable retrieval systems for users providing visual and textual feedback, representing a novel method for a known bottleneck.

The paper tackles the problem of media retrieval using multimodal queries by proposing a SynthTriplet GAN framework that expands queries with synthetic images, achieving state-of-the-art results in multimodal retrieval tasks.

This paper addresses the problem of media retrieval using a multimodal query (a query which combines visual input with additional semantic information in natural language feedback). We propose a SynthTriplet GAN framework which resolves this task by expanding the multimodal query with a synthetically generated image that captures semantic information from both image and text input. We introduce a novel triplet mining method that uses a synthetic image as an anchor to directly optimize for embedding distances of generated and target images. We demonstrate that apart from the added value of retrieval illustration with synthetic image with the focus on customization and user feedback, the proposed method greatly surpasses other multimodal generation methods and achieves state of the art results in the multimodal retrieval task. We also show that in contrast to other retrieval methods, our method provides explainable embeddings.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes