CL AIMay 19, 2023

DMDD: A Large-Scale Dataset for Dataset Mentions Detection

Huitong Pan, Qi Zhang, Eduard Dragut, Cornelia Caragea, Longin Jan Latecki

arXiv:2305.11779v121.7134 citations

Originality Synthesis-oriented

AI Analysis

This addresses the need for better automatic information extraction in scientific literature for researchers, though it is incremental as it primarily provides a new dataset.

The paper tackles the problem of limited size and naming diversity in corpora for dataset mention detection by introducing DMDD, the largest publicly available corpus with over 449,000 dataset mentions from 31,219 scientific articles, and establishes baseline performance for this task.

The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises of 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.

View on arXiv PDF

Similar