CL LGJul 1, 2019

Is artificial data useful for biomedical Natural Language Processing algorithms?

Zixu Wang, Julia Ive, Sumithra Velupillai, Lucia Specia

arXiv:1907.01055v231.01090 citations

Originality Incremental advance

AI Analysis

This addresses data accessibility issues for researchers and practitioners in biomedical NLP, offering a practical solution to enhance model training, though it is incremental as it builds on prior work in data generation.

The paper tackled the problem of data scarcity in biomedical NLP by generating artificial clinical text with key phrases, showing that using this data alongside real data boosts performance for neural networks and can even fully replace real training data in some setups.

A major obstacle to the development of Natural Language Processing (NLP) methods in the biomedical domain is data accessibility. This problem can be addressed by generating medical data artificially. Most previous studies have focused on the generation of short clinical text, and evaluation of the data utility has been limited. We propose a generic methodology to guide the generation of clinical text with key phrases. We use the artificial data as additional training data in two key biomedical NLP tasks: text classification and temporal relation extraction. We show that artificially generated training data used in conjunction with real training data can lead to performance boosts for data-greedy neural network algorithms. We also demonstrate the usefulness of the generated data for NLP setups where it fully replaces real training data.

View on arXiv PDF

Similar