CL AIApr 10, 2022

Data Augmentation for Biomedical Factoid Question Answering

Dimitris Pappas, Prodromos Malakasiotis, Ion Androutsopoulos

arXiv:2204.04711v132.0640 citationsh-index: 47Has Code

Originality Synthesis-oriented

AI Analysis

This addresses the challenge of limited training data in biomedical QA, offering practical solutions for researchers and practitioners, though it is incremental in applying existing augmentation techniques to a specific domain.

The study investigated the impact of seven data augmentation methods on biomedical factoid question answering, finding that word2vec-based word substitution performed best and led to significant performance gains, even with large pre-trained Transformers.

We study the effect of seven data augmentation (da) methods in factoid question answering, focusing on the biomedical domain, where obtaining training instances is particularly difficult. We experiment with data from the BioASQ challenge, which we augment with training instances obtained from an artificial biomedical machine reading comprehension dataset, or via back-translation, information retrieval, word substitution based on word2vec embeddings, or masked language modeling, question generation, or extending the given passage with additional context. We show that da can lead to very significant performance gains, even when using large pre-trained Transformers, contributing to a broader discussion of if/when da benefits large pre-trained models. One of the simplest da methods, word2vec-based word substitution, performed best and is recommended. We release our artificial training instances and code.

View on arXiv PDF Code

Similar