CLApr 2, 2023

A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets

Iva Bojic, Josef Halim, Verena Suharman, Sreeja Tar, Qi Chwen Ong, Duy Phung, Mathieu Ravaut, Shafiq Joty, Josip Car

arXiv:2304.00483v227.9263 citationsh-index: 89

Originality Synthesis-oriented

AI Analysis

This work addresses the need for high-quality training data in domain-specific applications, such as biomedical MRC, where expert annotation is costly, and it is incremental as it applies existing methods like back translation to improve datasets.

The paper tackles the problem of low-quality data in domain-specific machine reading comprehension by proposing a data-centric framework to enhance dataset quality, resulting in up to 33% and 40% relative improvement for retrieval and reader models on the BioASQ dataset.

Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance. High-quality datasets are needed for general-purpose Large Language Models (LLMs) training, as well as for domain-specific models, which are usually small in size as it is costly to engage a large number of domain experts for their creation. Thus, it is vital to ensure high-quality domain-specific training data. In this paper, we propose a framework for enhancing the data quality of original datasets. We applied the proposed framework to four biomedical datasets and showed relative improvement of up to 33%/40% for fine-tuning of retrieval/reader models on the BioASQ dataset when using back translation to enhance the original dataset quality.

View on arXiv PDF

Similar