CLJan 25, 2021

TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics

arXiv:2101.10273v1807 citations
Originality Synthesis-oriented
AI Analysis

This work addresses the need for specialized entity tagging in scientific papers to aid summarization and knowledge discovery, but it is incremental as it builds on prior information extraction efforts.

The authors tackled the problem of extracting Tasks, Datasets, and Metrics from scientific literature by creating a new annotated corpus of 2,000 sentences from NLP papers, and they applied a tagger to 30,000 papers from the ACL Anthology.

Tasks, Datasets and Evaluation Metrics are important concepts for understanding experimental scientific papers. However, most previous work on information extraction for scientific literature mainly focuses on the abstracts only, and does not treat datasets as a separate type of entity (Zadeh and Schumann, 2016; Luan et al., 2018). In this paper, we present a new corpus that contains domain expert annotations for Task (T), Dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers. We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL Anthology. The corpus is made publicly available to the community for fostering research on scientific publication summarization (Erera et al., 2019) and knowledge discovery.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes