DSDMIRJan 15, 2013

Various improvements to text fingerprinting

arXiv:1301.3488v14 citations
Originality Synthesis-oriented
AI Analysis

This work addresses incremental improvements in text fingerprinting algorithms for computational biology or data compression applications.

The paper tackles the problem of computing all distinct character sets (fingerprints) in substrings of a text and efficiently answering queries about them, presenting new exact and approximate algorithms for three related problems.

Let s = s_1 .. s_n be a text (or sequence) on a finite alphabet Σof size σ. A fingerprint in s is the set of distinct characters appearing in one of its substrings. The problem considered here is to compute the set {\cal F} of all fingerprints of all substrings of s in order to answer efficiently certain questions on this set. A substring s_i .. s_j is a maximal location for a fingerprint f in F (denoted by <i,j>) if the alphabet of s_i .. s_j is f and s_{i-1}, s_{j+1}, if defined, are not in f. The set of maximal locations ins is {\cal L} (it is easy to see that |{\cal L}| \leq n σ). Two maximal locations <i,j> and <k,l> such that s_i .. s_j = s_k .. s_l are named {\em copies}, and the quotient set of {\cal L} according to the copy relation is denoted by {\cal L}_C. We present new exact and approximate efficient algorithms and data structures for the following three problems: (1) to compute {\cal F}; (2) given f as a set of distinct characters in Σ, to answer if f represents a fingerprint in {\cal F}; (3) given f, to find all maximal locations of f in s.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes