CLMay 11, 2020

Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data

Soumya Banerjee, Debarshi Kumar Sanyal, Samiran Chattopadhyay, Plaban Kumar Bhowmick, Parthapratim Das

arXiv:2005.05414v216 citations

AI Analysis

This work addresses the need for discourse-level search and recommendation in non-bio domains like computer science, though it is incremental as it adapts existing methods to a new domain.

The paper tackled the problem of automatically segmenting scientific abstracts into discourse categories in fields with sparse labeled data, achieving 75% accuracy on a test corpus of computer science papers using transfer learning from biomedical abstracts.

The abstract of a scientific paper distills the contents of the paper into a short paragraph. In the biomedical literature, it is customary to structure an abstract into discourse categories like BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION, but this segmentation is uncommon in other fields like computer science. Explicit categories could be helpful for more granular, that is, discourse-level search and recommendation. The sparsity of labeled data makes it challenging to construct supervised machine learning solutions for automatic discourse-level segmentation of abstracts in non-bio domains. In this paper, we address this problem using transfer learning. In particular, we define three discourse categories BACKGROUND, TECHNIQUE, OBSERVATION-for an abstract because these three categories are the most common. We train a deep neural network on structured abstracts from PubMed, then fine-tune it on a small hand-labeled corpus of computer science papers. We observe an accuracy of 75% on the test corpus. We perform an ablation study to highlight the roles of the different parts of the model. Our method appears to be a promising solution to the automatic segmentation of abstracts, where the labeled data is sparse.

View on arXiv PDF

Similar