CLNov 14, 2023

All Data on the Table: Novel Dataset and Benchmark for Cross-Modality Scientific Information Extraction

Yuhan Li, Jian Wu, Zhiwei Yu, Börje F. Karlsson, Wei Shen, Manabu Okumura, Chin-Yew Lin

Peking U

arXiv:2311.08189v31.34 citationsh-index: 19

Originality Incremental advance

AI Analysis

It addresses a data availability gap for researchers working on cross-modality information extraction from scientific papers, which is incremental as it builds on existing SciIE efforts by extending to multi-modal data.

The paper tackles the lack of cross-modality datasets for scientific information extraction by proposing a semi-supervised pipeline to annotate entities and relations in both text and tables, resulting in the release of a high-quality benchmark and large-scale corpus, with baseline performance reported for state-of-the-art models and exploration of ChatGPT's capabilities.

Extracting key information from scientific papers has the potential to help researchers work more efficiently and accelerate the pace of scientific progress. Over the last few years, research on Scientific Information Extraction (SciIE) witnessed the release of several new systems and benchmarks. However, existing paper-focused datasets mostly focus only on specific parts of a manuscript (e.g., abstracts) and are single-modality (i.e., text- or table-only), due to complex processing and expensive annotations. Moreover, core information can be present in either text or tables or across both. To close this gap in data availability and enable cross-modality IE, while alleviating labeling costs, we propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure. Based on this pipeline, we release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline. We further report the performance of state-of-the-art IE models on the proposed benchmark dataset, as a baseline. Lastly, we explore the potential capability of large language models such as ChatGPT for the current task. Our new dataset, results, and analysis validate the effectiveness and efficiency of our semi-supervised pipeline, and we discuss its remaining limitations.

View on arXiv PDF

Similar