IRJan 9, 2021

Selection of Optimal Parameters in the Fast K-Word Proximity Search Based on Multi-component Key Indexes

arXiv:2101.03327v1

Originality Synthesis-oriented

AI Analysis

This work aims to improve the efficiency and quality of proximity full-text search for users by optimizing system parameters, particularly for queries with frequently occurring words. It is an incremental improvement to an existing system.

This paper investigates how search performance and quality are affected by the MaxDistance parameter and other parameters in a fast k-word proximity search system. The authors propose a new index schema based on their experimental analysis, which previously showed up to 130 times faster query execution for high-frequency word queries.

Proximity full-text search is commonly implemented in contemporary full-text search systems. Let us assume that the search query is a list of words. It is natural to consider a document as relevant if the queried words are near each other in the document. The proximity factor is even more significant for the case where the query consists of frequently occurring words. Proximity full-text search requires the storage of information for every occurrence in documents of every word that the user can search. For every occurrence of every word in a document, we employ additional indexes to store information about nearby words, that is, the words that occur in the document at distances from the given word of less than or equal to the MaxDistance parameter. We showed in previous works that these indexes can be used to improve the average query execution time by up to 130 times for queries that consist of words occurring with high-frequency. In this paper, we consider how both the search performance and the search quality depend on the value of MaxDistance and other parameters. Well-known GOV2 text collection is used in the experiments for reproducibility of the results. We propose a new index schema after the analysis of the results of the experiments. This is a pre-print of a contribution published in Supplementary Proceedings of the XXII International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL 2020), Voronezh, Russia, October 13-16, 2020, P. 336-350, published by CEUR Workshop Proceedings. The final authenticated version is available online at: http://ceur-ws.org/Vol-2790/

View on arXiv PDF

Similar