Compressed Index with Construction in Compressed Space
For stringologists, this provides an index with near-optimal space and construction in compressed space, improving over previous methods that required more space during construction.
The paper presents an index for a string s with space O(δ log(n/δ)) that supports pattern search in O(m + (occ+1) log^ε n) time, and can be built in O(n log n) expected time using O(δ log(n/δ)) construction space in a streaming pass. This is the first index achieving such construction space and time guarantees.
Suppose that we are given a string $s$ of length $n$ over an alphabet $\{0,1,\ldots,n^{O(1)}\}$ and $δ$ is the string complexity of $s$, a known compression measure. We describe an index on $s$ with $O(δ\log\frac{n}δ)$ space, measured in $O(\log n)$-bit machine words, which can search in $s$ any string of length $m$ in $O(m + (\mathrm{occ} + 1)\log^εn)$ time, where $\mathrm{occ}$ is the number of occurrences and $ε> 0$ is any fixed constant (the big-O in the space bound hides factor $\frac{1}ε$). Crucially, the index can be built in $O(n\log n)$ expected time by one left-to-right pass on the string $s$ in a streaming fashion with $O(δ\log\frac{n}δ)$ construction space. The index does not use the Karp--Rabin fingerprints, and the randomization in the construction time can be eliminated by using deterministic dictionaries instead of hash tables (with a slowdown). The search time matches currently best results and the space is almost optimal (the known optimum is $O(δ\log\frac{n}{δα})$, where $α= \log_σn$ and $σ$ is the alphabet size, and it coincides with $O(δ\log\frac{n}δ)$ when $δ= O(n / α^2)$). This is the first index that can be constructed within such space and with such time guarantees. To avoid uninteresting marginal cases, all above bounds are stated for $δ\ge Ω(\log\log n)$.