Shunsuke Inenaga

54.9DSJun 4

Counting Distinct (Non-)Crossing Substrings in Optimal Time

Haruki Umezaki, Hiroki Shibata, Dominik Köppl et al.

Let $w$ be a string of length $n$. The problem of counting factors crossing a position -- Problem 64 from the textbook ``125 Problems in Text Algorithms'' [Crochemore, Lecroq, and Rytter, 2021] -- asks to count the number $\mathcal{C}(w,k)$ (resp. $\mathcal{N}(w,k)$) of distinct substrings in $w$ that have occurrences containing (resp. not containing) a position $k$ in $w$. The solutions provided in their textbook compute $\mathcal{C}(w,k)$ and $\mathcal{N}(w,k)$ in $O(n)$ time for a single position $k$ in $w$, and thus a direct application would require $O(n^2)$ time for all positions $k = 1, \ldots, n$ in $w$. Their solution is designed for constant-size alphabets. In this paper, we present new algorithms which compute $\mathcal{C}(w,k)$ in $O(n)$ total time for general ordered alphabets, and $\mathcal{N}(w,k)$ in $O(n)$ total time for linearly sortable alphabets,for all positions $k = 1, \ldots, n$ in $w$. We further derive model-dependent optimal bounds by separating the algorithms into preprocessing and linear-time postprocessing: for $\mathcal{C}$ the preprocessing is run reporting, and for $\mathcal{N}$ it is preprocessing based on longest previous non-overlapping factors (LPnF) and longest next factors (LNF). In particular, all values $\mathcal{C}(w,k)$ can be computed in $O(n\log n)$ time over general unordered alphabets in which direct accesses to alphabet characters are restricted to equality tests, and in $O(n\logσ)$ time in the word RAM model, where $σ$ denotes the number of distinct characters occurring in $w$. For $\mathcal{N}(w,k)$, the equality-testing complexity over general unordered alphabets is $Θ(n^2)$. We also show that our upper bounds are optimal for all of the aforementioned alphabet assumptions and computation models.

52.7DSMay 28

On the sensitivity of CDAWG-grammars

Hiroto Fujimaru, Shunsuke Inenaga

The compact directed acyclic word graph (CDAWG) [Blumer et al. 1987] of a string is the minimal compact automaton that recognizes all the suffixes of the string. CDAWGs can be used for various string tasks including text pattern searching, data compression, and pattern discovery. The CDAWG-grammar [Belazzougui & Cunial 2017] is a grammar-based text compression based on the CDAWG, which allows for representing the CDAWG in $O(e)$ space without storing the string, where $e$ denotes the number of CDAWG edges. Let $g$ be the size of the CDAWG-grammar for the input string $T$. We show that the worst-case additive sensitivity of the CDAWG-grammar is lower bounded by $3g-21$ and is upper bounded by $8 g + 4$.

DSApr 16, 2019

c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches

Kazuya Tsuruta, Dominik Köppl, Shunsuke Kanda et al.

Given a dynamic set $K$ of $k$ strings of total length $n$ whose characters are drawn from an alphabet of size $σ$, a keyword dictionary is a data structure built on $K$ that provides locate, prefix search, and update operations on $K$. Under the assumption that $α= w / \lg σ$ characters fit into a single machine word $w$, we propose a keyword dictionary that represents $K$ in $n \lg σ+ Θ(k \lg n)$ bits of space, supporting all operations in $O(m / α+ \lg α)$ expected time on an input string of length $m$ in the word RAM model. This data structure is underlined with an exhaustive practical evaluation, highlighting the practical usefulness of the proposed data structure, especially for prefix searches - one of the most elementary keyword dictionary operations.

Shunsuke Inenaga

3 Papers