Data Augmentation Techniques for Process Extraction from Scientific Publications
This addresses the challenge of low-resource settings in scientific domains like chemistry, where data scarcity leads to overfitting, though it is incremental as it builds on existing sequence labeling methods.
The paper tackles the problem of process extraction from scientific publications by proposing data augmentation techniques that improve model performance, achieving up to a 12.3-point increase in F-score on chemistry datasets.
We present data augmentation techniques for process extraction tasks in scientific publications. We cast the process extraction task as a sequence labeling task where we identify all the entities in a sentence and label them according to their process-specific roles. The proposed method attempts to create meaningful augmented sentences by utilizing (1) process-specific information from the original sentence, (2) role label similarity, and (3) sentence similarity. We demonstrate that the proposed methods substantially improve the performance of the process extraction model trained on chemistry domain datasets, up to 12.3 points improvement in performance accuracy (F-score). The proposed methods could potentially reduce overfitting as well, especially when training on small datasets or in a low-resource setting such as in chemistry and other scientific domains.