21.1FLMar 16
MSO-Enumeration Over SLP-Compressed Unranked ForestsMarkus Lohrey, Markus L. Schmid
We study the problem of enumerating the answers to a query formulated in monadic second order logic (MSO) over an unranked forest F that is compressed by a straight-line program (SLP) D. Our main result states that this can be done after O(|D|) preprocessing and with output-linear delay (in data complexity). This is a substantial improvement over the previously known algorithms for MSO-evaluation over trees, since the compressed size |D| might be much smaller than (or even logarithmic in) the actual data size |F|, and there are linear time SLP-compressors that yield very good compressions on practical inputs. In particular, this also constitutes a meta-theorem in the field of algorithmics on SLP-compressed inputs: all enumeration problems on trees or strings that can be formulated in MSO-logic can be solved with linear preprocessing and output-linear delay, even if the inputs are compressed by SLPs. We also show that our approach can support vertex relabelling updates in time that is logarithmic in the uncompressed data. Our result extends previous work on the enumeration of MSO-queries over uncompressed trees and on the enumeration of document spanners over compressed text documents.
17.5FLMar 17
A General Information Extraction Framework Based on Formal LanguagesMarkus L. Schmid
For a terminal alphabet $Σ$ and an attribute alphabet $Î$, a $(Σ, Î)$-extractor is a function that maps every string over $Σ$ to a table with a column per attribute and with sets of positions of $w$ as cell entries. This rather general information extraction framework extends the well-known document spanner framework, which has intensively been investigated in the database theory community over the last decade. Moreover, our framework is based on formal language theory in a particularly clean and simple way. In addition to this conceptual contribution, we investigate closure properties, different representation formalisms and the complexity of natural decision problems for extractors.
DBOct 26, 2020
Refl-Spanners: A Purely Regular Approach to Non-Regular Core SpannersMarkus L. Schmid, Nicole Schweikardt
The regular spanners (characterised by vset-automata) are closed under the algebraic operations of union, join and projection, and have desirable algorithmic properties. The core spanners (introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015) as a formalisation of the core functionality of the query language AQL used in IBM's SystemT) additionally need string-equality selections and it has been shown by Freydenberger and Holldack (ICDT 2016, Theory of Computing Systems 2018) that this leads to high complexity and even undecidability of the typical problems in static analysis and query evaluation. We propose an alternative approach to core spanners: by incorporating the string-equality selections directly into the regular language that represents the underlying regular spanner (instead of treating it as an algebraic operation on the table extracted by the regular spanner), we obtain a fragment of core spanners that, while having slightly weaker expressive power than the full class of core spanners, arguably still covers the intuitive applications of string-equality selections for information extraction and has much better upper complexity bounds of the typical problems in static analysis and query evaluation.
FLMar 14, 2019
Regular Expressions with Backreferences: Polynomial-Time Matching TechniquesMarkus L. Schmid
Regular expressions with backreferences (regex, for short), as supported by most modern libraries for regular expression matching, have an NP-complete matching problem. We define a complexity parameter of regex, called active variable degree, such that regex with this parameter bounded by a constant can be matched in polynomial-time. Moreover, we formulate a novel type of determinism for regex (on an automaton-theoretic level), which yields the class of memory-deterministic regex that can be matched in time O(|w|p(|r|)) for a polynomial p (where r is the regex and w the word). Natural extensions of these concepts lead to properties of regex that are intractable to check.