CVJun 1, 2014

Seeing the Big Picture: Deep Embedding with Contextual Evidences

arXiv:1406.0132v121 citations
Originality Incremental advance
AI Analysis

This work addresses a specific bottleneck in image retrieval for computer vision applications, offering an incremental improvement over existing methods.

The paper tackles the problem of false matches in Bag-of-Words image retrieval by defining true matches based on local, regional, and global keypoint similarities and proposes a Deep Embedding framework using CNN features, which significantly improves retrieval accuracy on three benchmark datasets with efficient memory and time usage.

In the Bag-of-Words (BoW) model based image retrieval task, the precision of visual matching plays a critical role in improving retrieval performance. Conventionally, local cues of a keypoint are employed. However, such strategy does not consider the contextual evidences of a keypoint, a problem which would lead to the prevalence of false matches. To address this problem, this paper defines "true match" as a pair of keypoints which are similar on three levels, i.e., local, regional, and global. Then, a principled probabilistic framework is established, which is capable of implicitly integrating discriminative cues from all these feature levels. Specifically, the Convolutional Neural Network (CNN) is employed to extract features from regional and global patches, leading to the so-called "Deep Embedding" framework. CNN has been shown to produce excellent performance on a dozen computer vision tasks such as image classification and detection, but few works have been done on BoW based image retrieval. In this paper, firstly we show that proper pre-processing techniques are necessary for effective usage of CNN feature. Then, in the attempt to fit it into our model, a novel indexing structure called "Deep Indexing" is introduced, which dramatically reduces memory usage. Extensive experiments on three benchmark datasets demonstrate that, the proposed Deep Embedding method greatly promotes the retrieval accuracy when CNN feature is integrated. We show that our method is efficient in terms of both memory and time cost, and compares favorably with the state-of-the-art methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes