Nicola Cotumaccio

1.2FLApr 21

A Myhill-Nerode Theorem for Generalized Automata, with Applications to Pattern Matching and Compression

Nicola Cotumaccio

The model of generalized automata, introduced by Eilenberg in 1974, allows representing a regular language more concisely than conventional automata by allowing edges to be labeled not only with characters, but also strings. Giammarresi and Montalbano introduced a notion of determinism for generalized automata [STACS 1995]. While generalized deterministic automata retain many properties of conventional deterministic automata, the uniqueness of a minimal generalized deterministic automaton is lost. In the first part of the paper, we show that the lack of uniqueness can be explained by introducing a set $ \mathcal{W(A)} $ associated with a generalized automaton $ \mathcal{A} $. In this way, we derive for the first time a full Myhill-Nerode theorem for generalized automata, which contains the textbook Myhill-Nerode theorem for conventional automata as a degenerate case. In the second part of the paper, we show that the set $ \mathcal{W(A)} $ leads to applications for pattern matching and data compression. We show that a Wheeler generalized automata can be stored using $ \mathfrak{e} \log σ(1 + o(1)) + O(e) $ bits so that pattern matching queries can be solved in $ O(m \log \log σ) $ time, where $ \mathfrak{e} $ is the total length of all edge labels, $ e $ is the number of edges, $ σ$ is the size of the alphabet and $ m $ is the length of the pattern.

FLJun 4, 2021

On (co-lex) Ordering Automata

Giovanna D'Agostino, Nicola Cotumaccio, Alberto Policriti et al.

The states of a deterministic finite automaton A can be identified with collections of words in Pf(L(A)) -- the set of prefixes of words belonging to the regular language accepted by A. But words can be ordered and among the many possible orders a very natural one is the co-lexicographic one. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton's states. In a number of papers automata admitting a total ordering of states coherent with the ordering of the set of words reaching them have been proposed. Such class of ordered automata -- the Wheeler automata -- turned out to be efficiently stored/searched using an index. Unfortunately not all automata can be totally ordered as previously outlined. However, automata can always be partially ordered and an intrinsic measure of their complexity can be defined and effectively determined, as the minimum width of one of their admissible partial orders. As shown in previous works, this new concept of width of an automaton has useful consequences in the fields of graph compression, indexing data structures, and automata theory. In this paper we prove that a canonical, minimum-width, partially-ordered automaton accepting a language L -- dubbed the Hasse automaton H of L -- can be exhibited. H provides, in a precise sense, the best possible way to (partially) order the states of any automaton accepting L, as long as we want to maintain an operational link with the (co-lexicographic) order of Pf(L(A)). Using H we prove that the width of the language can be effectively computed from the minimum automaton recognizing the language. Finally, we explore the relationship between two (often conflicting) objectives: minimizing the width and minimizing the number of states of an automaton.

Nicola Cotumaccio

2 Papers