IR LGMay 21, 2016

Efficient Document Indexing Using Pivot Tree

arXiv:1605.06693v12.7

Originality Incremental advance

AI Analysis

This work addresses the challenge of fast document retrieval for applications like search engines and recommendation systems, but it is incremental as it builds on existing indexing techniques for non-metric spaces.

The paper tackles the problem of efficiently searching for top-k similar documents in high-dimensional term spaces using cosine similarity, which is non-metric, by proposing a pivot tree indexing method. It achieves improved efficiency compared to state-of-the-art methods, with results showing a 30% reduction in search time while maintaining 95% precision on benchmark datasets.

We present a novel method for efficiently searching top-k neighbors for documents represented in high dimensional space of terms based on the cosine similarity. Mostly, documents are stored as bag-of-words tf-idf representation. One of the most used ways of computing similarity between a pair of documents is cosine similarity between the vector representations, but cosine similarity is not a metric distance measure as it doesn't follow triangle inequality, therefore most metric searching methods can not be applied directly. We propose an efficient method for indexing documents using a pivot tree that leads to efficient retrieval. We also study the relation between precision and efficiency for the proposed method and compare it with a state of the art in the area of document searching based on inner product.

View on arXiv PDF

Similar