CLAIIRMay 18, 2023

BioAug: Conditional Generation based Data Augmentation for Low-Resource Biomedical NER

arXiv:2305.10647v117 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the lack of high-quality labeled data for low-resource BioNER, which is crucial for biomedical text analysis, though it is an incremental advance in data augmentation methods.

The paper tackles the problem of data scarcity in biomedical named entity recognition (BioNER) by introducing BioAug, a data augmentation framework that generates factual and diverse augmentations, achieving absolute improvements of 1.5% to 21.5% over baselines on five benchmark datasets.

Biomedical Named Entity Recognition (BioNER) is the fundamental task of identifying named entities from biomedical text. However, BioNER suffers from severe data scarcity and lacks high-quality labeled data due to the highly specialized and expert knowledge required for annotation. Though data augmentation has shown to be highly effective for low-resource NER in general, existing data augmentation techniques fail to produce factual and diverse augmentations for BioNER. In this paper, we present BioAug, a novel data augmentation framework for low-resource BioNER. BioAug, built on BART, is trained to solve a novel text reconstruction task based on selective masking and knowledge augmentation. Post training, we perform conditional generation and generate diverse augmentations conditioning BioAug on selectively corrupted text similar to the training stage. We demonstrate the effectiveness of BioAug on 5 benchmark BioNER datasets and show that BioAug outperforms all our baselines by a significant margin (1.5%-21.5% absolute improvement) and is able to generate augmentations that are both more factual and diverse. Code: https://github.com/Sreyan88/BioAug.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes