CVApr 28, 2018

Learning Cross-Modal Deep Embeddings for Multi-Object Image Retrieval using Text and Sketch

arXiv:1804.10819v125 citations
Originality Incremental advance
AI Analysis

This work addresses the need for flexible image retrieval systems that can handle multiple objects and input modalities, though it is incremental in improving existing cross-modal approaches.

The paper tackles the problem of multi-object image retrieval using both text and sketch queries by introducing a cross-modal deep network with an attention mechanism to focus on different objects. The proposed method achieves state-of-the-art performance in both single and multiple object retrieval on standard datasets.

In this work we introduce a cross modal image retrieval system that allows both text and sketch as input modalities for the query. A cross-modal deep network architecture is formulated to jointly model the sketch and text input modalities as well as the the image output modality, learning a common embedding between text and images and between sketches and images. In addition, an attention model is used to selectively focus the attention on the different objects of the image, allowing for retrieval with multiple objects in the query. Experiments show that the proposed method performs the best in both single and multiple object image retrieval in standard datasets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes