Dynamic Grammar-Compressed Self-Index in $delta$-Optimal Space
For practitioners managing highly repetitive dynamic string collections (e.g., versioned documents, genomic databases), this index offers the first solution with optimal space and practical speed.
The authors present the first dynamic grammar-compressed self-index achieving δ-optimal space, supporting efficient locate queries and updates without LCP dependence. On large repetitive datasets, it outperforms existing dynamic indexes by up to 77× on updates and 11× on locate.
A compressed self-index stores a string in compressed form while supporting locate queries without decompression. For highly repetitive strings (arising in web crawls, versioned documents, and genomic collections), static self-indexes can match the $δ$-optimal lower bound of $Ω(δ\log(n \log σ/ (δ\log n)) \log n)$ bits up to constant factors, where $n$ is the string length, $σ$ is the alphabet size, and $δ$ is the substring complexity. Their dynamic counterparts, however, remain scarce: every existing dynamic self-index either fails to attain $δ$-optimal space, pays at least $Θ(\log n)$ time per reported occurrence during locate, or exposes the longest common prefix (LCP) of the text inside its update time. We present the dynamic RR-index, a dynamic grammar-compressed self-index built on the restricted recompression run-length straight-line program (RLSLP). To our knowledge, it is the first dynamic self-index to attain $δ$-optimal space. The index occupies expected $O(δ\log(n \log σ/ (δ\log n)) \log n)$ bits, answers locate queries in expected $O(m + \log m \log^{2} n + \mathit{occ} (\log n / \log \log n))$ time (where $m$ is the pattern length and $\mathit{occ}$ is the number of occurrences), and supports insertions and deletions of a length-$m'$ substring in expected amortized $O(m' \log^{2} n + \log^{3} n)$ time, with no dependence on the LCP. On eleven highly repetitive corpora, including a $37$ GB Wikipedia dump and a $59$ GB human-chromosome collection, the dynamic RR-index is up to $77\times$ faster than the dynamic r-index on updates and up to $11\times$ faster than other dynamic indexes on locate.