DSIRJul 25, 2013

Optimal Top-k Document Retrieval

arXiv:1307.6789v21 citations
Originality Incremental advance
AI Analysis

This work provides optimal solutions for document retrieval in information retrieval systems, with applications in search engines and databases, though it is incremental in building on geometric search techniques.

The paper tackles the problem of efficiently retrieving the top-k most relevant documents containing a query pattern from a collection, achieving optimal query time of O(p/log_σn + k) in the static case with linear space, and also addresses dynamic and extended static scenarios with specified time and space bounds.

Let $\mathcal{D}$ be a collection of $D$ documents, which are strings over an alphabet of size $σ$, of total length $n$. We describe a data structure that uses linear space and and reports $k$ most relevant documents that contain a query pattern $P$, which is a string of length $p$, in time $O(p/\log_σn+k)$, which is optimal in the RAM model in the general case where $\lg D = Θ(\log n)$, and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures... [clip] When $\lg D = o(\log n)$, we show how to reduce the space of the data structure from $O(n\log n)$ to $O(n(\logσ+\log D+\log\log n))$ bits... [clip] We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time $O(p(\log\log n)^2/\log_σn+\log n + k\log\log k)$, whereas insertions and deletions require $O(\log^{1+ε} n)$ time per symbol, for any constant $ε>0$. Finally, we consider an extended static scenario where an extra parameter $par(P,d)$ is defined, and the query must retrieve only documents $d$ such that $par(P,d)\in [τ_1,τ_2]$, where this range is specified at query time. We solve these queries using linear space and $O(p/\log_σn + \log^{1+ε} n + k\log^εn)$ time, for any constant $ε>0$. Our technique is to translate these top-$k$ problems into multidimensional geometric search problems. As an additional bonus, we describe some improvements to those problems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes