A Myhill-Nerode Theorem for Generalized Automata, with Applications to Pattern Matching and Compression
This provides a foundational theory for generalized automata with applications in pattern matching and compression, though it is incremental on prior work.
The paper tackles the lack of a unique minimal generalized deterministic automaton by introducing a set to derive a full Myhill-Nerode theorem for generalized automata, and applies this to achieve pattern matching in O(m log log σ) time with storage of e log σ(1 + o(1)) + O(e) bits.
The model of generalized automata, introduced by Eilenberg in 1974, allows representing a regular language more concisely than conventional automata by allowing edges to be labeled not only with characters, but also strings. Giammarresi and Montalbano introduced a notion of determinism for generalized automata [STACS 1995]. While generalized deterministic automata retain many properties of conventional deterministic automata, the uniqueness of a minimal generalized deterministic automaton is lost. In the first part of the paper, we show that the lack of uniqueness can be explained by introducing a set $ \mathcal{W(A)} $ associated with a generalized automaton $ \mathcal{A} $. In this way, we derive for the first time a full Myhill-Nerode theorem for generalized automata, which contains the textbook Myhill-Nerode theorem for conventional automata as a degenerate case. In the second part of the paper, we show that the set $ \mathcal{W(A)} $ leads to applications for pattern matching and data compression. We show that a Wheeler generalized automata can be stored using $ \mathfrak{e} \log σ(1 + o(1)) + O(e) $ bits so that pattern matching queries can be solved in $ O(m \log \log σ) $ time, where $ \mathfrak{e} $ is the total length of all edge labels, $ e $ is the number of edges, $ σ$ is the size of the alphabet and $ m $ is the length of the pattern.