DBAIJun 5, 2019

An Effective Algorithm for Learning Single Occurrence Regular Expressions with Interleaving

arXiv:1906.02074v13 citations
Originality Incremental advance
AI Analysis

This work addresses schema inference for XML documents, a practical issue in data management, but appears incremental as it builds on prior subclasses of regular expressions.

The paper tackles the problem of inferring XML schemas from documents without a schema by proposing a new subclass called Single Occurrence Regular Expressions with Interleaving (SOIRE) that supports unrestricted interleaving, and develops an algorithm iSOIRE for learning SOIREs, with experiments on real datasets showing high preciseness and conciseness compared to existing methods.

The advantages offered by the presence of a schema are numerous. However, many XML documents in practice are not accompanied by a (valid) schema, making schema inference an attractive research problem. The fundamental task in XML schema learning is inferring restricted subclasses of regular expressions. Most previous work either lacks support for interleaving or only has limited support for interleaving. In this paper, we first propose a new subclass Single Occurrence Regular Expressions with Interleaving (SOIRE), which has unrestricted support for interleaving. Then, based on single occurrence automaton and maximum independent set, we propose an algorithm iSOIRE to infer SOIREs. Finally, we further conduct a series of experiments on real datasets to evaluate the effectiveness of our work, comparing with both ongoing learning algorithms in academia and industrial tools in real-world. The results reveal the practicability of SOIRE and the effectiveness of iSOIRE, showing the high preciseness and conciseness of our work.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes