The Statistical Dictionary-based String Matching Problem
This work addresses a theoretical problem in string matching and information retrieval, but it appears incremental as it builds on the classic DSM problem with a new formulation.
The paper tackles the Dictionary-based String Matching (DSM) problem by proposing a statistical and information-theoretic formulation called Statistical DSM, which models source and query with statistical descriptions and strings as codes, and analyzes retrieval efficiency as average query cost for long sequences, specifically studying cases for k-grams and prefix-free codes.
In the Dictionary-based String Matching (DSM) problem, a retrieval system has access to a source sequence and stores the position of a certain number of strings in a posting table. When a user inquires the position of a string, the retrieval system, instead of searching in the source sequence directly, relies on the the posting table to answer the query more efficiently. In this paper, the Statistical DSM problem is a proposed as a statistical and information-theoretic formulation of the classic DSM problem in which both the source and the query have a statistical description while the strings stored in the posting sequence are described as a code. Through this formulation, we are able to define the efficiency of the retrieval system as the average cost in answering a users' query in the limit of sufficiently long source sequence. This formulation is used to study the retrieval performance for the case in which (i) all the strings of a given length, referred to as k-grams , and (ii) prefix-free codes.