Tomasz Kociumaka

22.5DSApr 17

The Communication Complexity of Pattern Matching with Edits Revisited

Tomasz Kociumaka, Jakob Nogler, Philip Wellnitz

In the decades-old Pattern Matching with Edits problem, given a length-$n$ string $T$ (the text), a length-$m$ string $P$ (the pattern), and a positive integer $k$ (the threshold), the task is to list the $k$-error occurrences of $P$ in $T$, that is, all fragments of $T$ whose edit distance to $P$ is at most $k$. The one-way communication complexity of Pattern Matching with Edits is the minimum number of bits that Alice, given an instance $(P, T, k)$ of the problem, must send to Bob so that Bob can reconstruct the answer solely from that message. For the natural parameter regime of $0 < k < m < n/2$, our recent work [STOC'24] yields that $Ω(n/m \cdot k \log(m/k))$ bits are necessary and $O(n/m \cdot k \log^2 m)$ bits are sufficient for Pattern Matching with Edits. More generally, for strings over an alphabet $Σ$, our recent work [STOC'24] gives an $O(n/m \cdot k \log m \log(m|Σ|))$-bit encoding that allows one to recover a shortest sequence of edits for every $k$-error occurrence of $P$ in $T$. In this work, we revisit the original proof and improve the encoding size to $O(n/m \cdot k \log(m|Σ|/k))$, which matches the lower bound for constant-sized alphabets. We further establish a new tight lower bound of $Ω(n/m \cdot k \log(m|Σ|/k))$ for the edit sequence reporting variant that we solve. Our encoding size also matches the communication complexity established for the simpler Pattern Matching with Mismatches problem in the context of streaming algorithms [Clifford, Kociumaka, Porat; SODA'19].

1.1DSMay 12

Random Access in Grammar-Compressed Strings: Optimal Trade-Offs in Almost All Parameter Regimes

Anouk Duyster, Tomasz Kociumaka

A Random Access query to a string $T\in [0..σ)^n$ asks for the character $T[i]$ at a given position $i\in [0..n)$. In $O(n\logσ)$ bits of space, this fundamental task admits constant-time queries. While this is optimal in the worst case, much research has focused on compressible strings, hoping for smaller data structures that still admit efficient queries. We investigate the grammar-compressed setting, where $T$ is represented by a straight-line grammar. Our main result is a general trade-off that optimizes Random Access time as a function of string length $n$, grammar size (the total length of productions) $g$, alphabet size $σ$, data structure size $M$, and word size $w=Ω(\log n)$ of the word RAM model. For any $M$ with $g\log n<Mw<n\logσ$, we show an $O(M)$-size data structure with query time $O(\frac{\log(n\logσ\,/\,Mw)}{\log(Mw\,/\,g\log n)})$. Remarkably, we also prove a matching unconditional lower bound that holds for all parameter regimes except very small grammars and relatively small data structures. Previous work focused on query time as a function of $n$ only, achieving $O(\log n)$ time using $O(g)$ space [Bille et al.; SIAM J. Comput. 2015] and $O(\frac{\log n}{\log \log n})$ time using $O(g\log^ε n)$ space for any constant $ε> 0$ [Belazzougui et al.; ESA'15], [Ganardi, Jeż, Lohrey; J. ACM 2021]. The only tight lower bound [Verbin and Yu; CPM'13] was $Ω(\frac{\log n}{\log\log n})$ for $w=Θ(\log n)$, $n^{Ω(1)}\le g\le n^{1-Ω(1)}$, and $M=g\log^{Θ(1)}n$. In contrast, our result yields tight bounds in all relevant parameters and almost all regimes. Our data structure admits efficient deterministic construction. It relies on novel grammar transformations that generalize contracting grammars [Ganardi; ESA'21]. Beyond Random Access, its variants support substring extraction, rank, and select.

Tomasz Kociumaka

2 Papers