Shunsuke Inenaga

h-index20

6papers

2,161citations

6 Papers

10.4DSJul 15

Laurentius Leonard, Shunsuke Inenaga, Hideo Bannai et al.

The sliding suffix tree (Fiala \& Greene, 1989) is a suffix tree that is maintained for a sliding window $W_i = T[i..i+d-1]$ of size $d$ that shifts over an input text $T$ of length $n$ from left to right, for increasing $i = 1, \ldots, n-d+1$. It is known that the sliding suffix tree can be maintained in $O(n \log σ)$ time with $O(d)$ space, where $σ$ is the alphabet size. Updating the sliding suffix tree from $W_i = T[i..i+d-1]$ to $W_{i+1} = T[i+1..i+d]$ requires the following three major tasks: (1) Delete the leaf that represents the longest suffix $W_i$, (2) Insert new leaves that represent the suffixes of $W_{i+1}$ that appear exactly once in $W_{i+1}$, and (3) After the leaf deletion due to Task (1) and each leaf insertion due to Task (2), maintain the label $\langle \ell, r \rangle$ of every edge as a valid pair in the new window $W_{i+1}$, such that $i+1 \leq \ell \leq r \leq i+d$. In this paper, we present the first algorithm that performs Task (3) in $O(1)$ worst-case time per node deletion/insertion, which leads to another alternative to efficient sliding suffix tree construction. This is an improvement over the existing algorithms by Larsson (1996, 1999) and by Senft (2005) both of which can only perform Task (3) in $O(1)$ amortized time. Our key data structure is a non-trivial extension of leaf pointers, which were originally proposed by Brodnik and Jekovec (2018) for pattern matching with sliding suffix trees.

4.1DSJun 16

Grammar Index By Induced Suffix Sorting

Tooru Akagi, Dominik Köppl, Yuto Nakashima et al.

Pattern matching is the most central task for text indices. Most recent indices leverage compression techniques to make pattern matching feasible for massive but highly-compressible datasets. Within this kind of indices, we propose a new compressed text index built upon a grammar compression based on induced suffix sorting [Nunes et al., DCC'18]. We show that this grammar exhibits a locality sensitive parsing property, which allows us to specify, given a pattern $P$, certain substrings of $P$, called cores, that are similarly parsed in the text grammar whenever these occurrences are extensible to occurrences of $P$. Supported by the cores, given a pattern of length $m$, we can locate all its $occ$ occurrences in a text $T$ of length $n$ within $O(m \lg |\mathcal{S}| + occ_C \lg|\mathcal{S}| \lg n + occ)$ time, where $\mathcal{S}$ is the set of all characters and non-terminals, $occ$ is the number of occurrences, and $occ_C$ is the number of occurrences of a chosen core $C$ of $P$ in the right hand side of all production rules of the grammar of $T$. Our grammar index requires $O(g)$ words of space and can be built in $O(n)$ time using $O(g)$ working space, where $g$ is the sum of the right hand sides of all production rules. We underline the strength of our grammar index with an exhaustive practical evaluation that gives evidence that our proposed solution excels at locating long patterns in highly-repetitive texts.

8.3DSJun 4

Counting Distinct (Non-)Crossing Substrings in Optimal Time

Haruki Umezaki, Hiroki Shibata, Dominik Köppl et al.

Let $w$ be a string of length $n$. The problem of counting factors crossing a position -- Problem 64 from the textbook ``125 Problems in Text Algorithms'' [Crochemore, Lecroq, and Rytter, 2021] -- asks to count the number $\mathcal{C}(w,k)$ (resp. $\mathcal{N}(w,k)$) of distinct substrings in $w$ that have occurrences containing (resp. not containing) a position $k$ in $w$. The solutions provided in their textbook compute $\mathcal{C}(w,k)$ and $\mathcal{N}(w,k)$ in $O(n)$ time for a single position $k$ in $w$, and thus a direct application would require $O(n^2)$ time for all positions $k = 1, \ldots, n$ in $w$. Their solution is designed for constant-size alphabets. In this paper, we present new algorithms which compute $\mathcal{C}(w,k)$ in $O(n)$ total time for general ordered alphabets, and $\mathcal{N}(w,k)$ in $O(n)$ total time for linearly sortable alphabets,for all positions $k = 1, \ldots, n$ in $w$. We further derive model-dependent optimal bounds by separating the algorithms into preprocessing and linear-time postprocessing: for $\mathcal{C}$ the preprocessing is run reporting, and for $\mathcal{N}$ it is preprocessing based on longest previous non-overlapping factors (LPnF) and longest next factors (LNF). In particular, all values $\mathcal{C}(w,k)$ can be computed in $O(n\log n)$ time over general unordered alphabets in which direct accesses to alphabet characters are restricted to equality tests, and in $O(n\logσ)$ time in the word RAM model, where $σ$ denotes the number of distinct characters occurring in $w$. For $\mathcal{N}(w,k)$, the equality-testing complexity over general unordered alphabets is $Θ(n^2)$. We also show that our upper bounds are optimal for all of the aforementioned alphabet assumptions and computation models.

8.2DSMay 28

On the sensitivity of CDAWG-grammars

Hiroto Fujimaru, Shunsuke Inenaga

The compact directed acyclic word graph (CDAWG) [Blumer et al. 1987] of a string is the minimal compact automaton that recognizes all the suffixes of the string. CDAWGs can be used for various string tasks including text pattern searching, data compression, and pattern discovery. The CDAWG-grammar [Belazzougui & Cunial 2017] is a grammar-based text compression based on the CDAWG, which allows for representing the CDAWG in $O(e)$ space without storing the string, where $e$ denotes the number of CDAWG edges. Let $g$ be the size of the CDAWG-grammar for the input string $T$. We show that the worst-case additive sensitivity of the CDAWG-grammar is lower bounded by $3g-21$ and is upper bounded by $8 g + 4$.

8.2DSJul 1

Online computation of maximal closed substrings

Hiroki Shibata, Haruki Umezaki, Takuya Mieno et al.

A non-empty string is closed if its length is one or its longest border appears exactly twice in the string. An occurrence of a closed substring is a maximal closed substring (MCS) if it cannot be extended to the left or to the right while preserving closedness. MCSs can be regarded as a general class of maximal repetitive structures including runs. In this paper, we study the computation of MCSs of a string given in an online manner, where one character is appended to the string at a time. Our algorithm detects newly formed MCSs after each append operation by using the rightmost previous occurrences of suffixes. To support this efficiently, we introduce the link-cut suffix tree (LCST), a novel data structure combining an online suffix tree with a link-cut tree. The LCST maintains rightmost occurrence information for substrings represented in the suffix tree in $O(n \log n)$ total time and $O(n)$ space, where $n$ is the length of the input string. Using the LCST, we obtain an $O(n \log n)$-time online algorithm for computing all MCSs, which is worst-case optimal. As further direct applications of the LCST, we obtain online algorithms for rightmost LZ77 factorizations and most recent match queries.

8.5DSJul 1

Efficient LCE Queries and Lexicographic Minimizers on Sliding Suffix Trees

Toshiharu Minematsu, Shunsuke Inenaga

We study longest-common-extension (LCE) queries and lexicographic minimizer maintenance on the suffix tree of a sliding window. The main difficulty is that a sliding suffix tree is maintained in an implicit Ukkonen-style form: some suffixes of the current window are not represented by leaves. We show that the longest implicit (i.e. non-leaf) suffix induces a periodic representative map that folds every implicit suffix to an explicit suffix leaf in constant time. Combined with leaf pointers [Leonard et al., PSC 2026] and a dynamic LCA data structure [Cole & Hariharan, SICOMP 2005], this yields a linear-space data structure with amortized constant-time window shifts and worst-case constant-time LCE queries over a constant-size alphabet. For minimizers, the LCE structure gives a direct exact solution, but it uses more machinery than fixed-depth comparisons require. We therefore give an alternative LCE-free algorithm that reports minimizers in constant time per window shift, which is built on BP-linked suffix trees [Sumiyoshi et al, SPIRE 2024] and a standard order maintenance data structure (e.g. [Bender et al., ESA 2002]).