CLMar 9, 2022

PET: An Annotated Dataset for Process Extraction from Natural Language Text

Patrizio Bellan, Han van der Aa, Mauro Dragoni, Chiara Ghidini, Simone Paolo Ponzetto

arXiv:2203.04860v23.033 citationsh-index: 31Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the problem of objective comparison and data-driven methods for researchers and practitioners in process discovery and natural language processing, though it is incremental as it fills a gap in existing resources.

The authors tackled the lack of gold-standard annotated datasets for process extraction from natural language text by creating the PET dataset, which includes business process descriptions annotated with activities, gateways, actors, and flow information, and they provided baselines to benchmark extraction challenges.

Process extraction from text is an important task of process discovery, for which various approaches have been developed in recent years. However, in contrast to other information extraction tasks, there is a lack of gold-standard corpora of business process descriptions that are carefully annotated with all the entities and relationships of interest. Due to this, it is currently hard to compare the results obtained by extraction approaches in an objective manner, whereas the lack of annotated texts also prevents the application of data-driven information extraction methodologies, typical of the natural language processing field. Therefore, to bridge this gap, we present the PET dataset, a first corpus of business process descriptions annotated with activities, gateways, actors, and flow information. We present our new resource, including a variety of baselines to benchmark the difficulty and challenges of business process extraction from text. PET can be accessed via huggingface.co/datasets/patriziobellan/PET

View on arXiv PDF

Similar