LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction
This work addresses a data bottleneck for researchers in natural language processing by creating a large-scale OIE dataset to support tasks like knowledge base creation and textual entailment, though it is incremental as it builds on existing QA-SRL data.
The authors tackled the problem of limited size and diversity in Open Information Extraction (OIE) datasets by introducing LSOIE, a new dataset converted from QA-SRL 2.0, which is 20 times larger than the previous largest human-annotated OIE dataset, and they provided benchmark models and baselines for evaluation.
Open Information Extraction (OIE) systems seek to compress the factual propositions of a sentence into a series of n-ary tuples. These tuples are useful for downstream tasks in natural language processing like knowledge base creation, textual entailment, and natural language understanding. However, current OIE datasets are limited in both size and diversity. We introduce a new dataset by converting the QA-SRL 2.0 dataset to a large-scale OIE dataset (LSOIE). Our LSOIE dataset is 20 times larger than the next largest human-annotated OIE dataset. We construct and evaluate several benchmark OIE models on LSOIE, providing baselines for future improvements on the task. Our LSOIE data, models, and code are made publicly available