DS DM IRJan 15, 2013

Various improvements to text fingerprinting

Djamal Belazzougui, Roman Kolpakov, Mathieu Raffinot

arXiv:1301.3488v14 citations

Originality Synthesis-oriented

AI Analysis

This work addresses incremental improvements in text fingerprinting algorithms for computational biology or data compression applications.

The paper tackles the problem of computing all distinct character sets (fingerprints) in substrings of a text and efficiently answering queries about them, presenting new exact and approximate algorithms for three related problems.

Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of s_i .. s_j is f and s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is {\cal L} (it is easy to see that |{\cal L}| \leq n σ). Two maximal locations <i,j> and <k,l> such that s_i .. s_j = s_k .. s_l are named {\em copies}, and the quotient set of {\cal L} according to the copy relation is denoted by {\cal L}_C. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute {\cal F}; (2) given f as a set of distinct characters in Σ, to answer if f represents a fingerprint in {\cal F}; (3) given f, to find all maximal locations of f in s.

View on arXiv PDF

Similar