A General Information Extraction Framework Based on Formal Languages
This work provides a theoretical foundation for information extraction in database theory, but it appears incremental as it builds on existing document spanner frameworks.
The authors introduced a general information extraction framework based on formal languages that extends the document spanner framework, and they investigated its closure properties, representation formalisms, and computational complexity.
For a terminal alphabet $Σ$ and an attribute alphabet $Î$, a $(Σ, Î)$-extractor is a function that maps every string over $Σ$ to a table with a column per attribute and with sets of positions of $w$ as cell entries. This rather general information extraction framework extends the well-known document spanner framework, which has intensively been investigated in the database theory community over the last decade. Moreover, our framework is based on formal language theory in a particularly clean and simple way. In addition to this conceptual contribution, we investigate closure properties, different representation formalisms and the complexity of natural decision problems for extractors.