Random mappings designed for commercial search engines
This addresses the problem of scalable document retrieval for search engine applications, offering a practical method that leverages existing commercial infrastructure, but it is incremental as it builds on prior mapping techniques for sparse indexing.
The paper tackles the problem of efficiently indexing high-dimensional document vectors for large-scale retrieval by mapping them to a sparse Hamming cube representation that preserves inner product ordering, enabling use of commercial text-based search engines. The result includes a theoretical analysis with asymptotic and non-asymptotic bounds, verified through simulations, and practical speed-ups demonstrated on real datasets like image color histograms and stock market indices.
We give a practical random mapping that takes any set of documents represented as vectors in Euclidean space and then maps them to a sparse subset of the Hamming cube while retaining ordering of inter-vector inner products. Once represented in the sparse space, it is natural to index documents using commercial text-based search engines which are specialized to take advantage of this sparse and discrete structure for large-scale document retrieval. We give a theoretical analysis of the mapping scheme, characterizing exact asymptotic behavior and also giving non-asymptotic bounds which we verify through numerical simulations. We balance the theoretical treatment with several practical considerations; these allow substantial speed up of the method. We further illustrate the use of this method on search over two real data sets: a corpus of images represented by their color histograms, and a corpus of daily stock market index values.