CL IR LGOct 23, 2019

Healthcare NER Models Using Language Model Pretraining

Amogh Kamat Tarcar, Aashis Tiwari, Vineet Naique Dhaimodker, Penjo Rebelo, Rahul Desai, Dattaraj Rao

arXiv:1910.11241v21.732 citations

Originality Synthesis-oriented

AI Analysis

This work addresses the challenge of domain shifts in healthcare NLP for tasks like adverse drug reaction studies, though it is incremental in applying existing techniques to medical data.

The paper tackles the problem of extracting structured information from unstructured Electronic Health Records by developing a custom Named Entity Recognition model using language model pretraining and transfer learning, achieving an F1 score of 0.734 with 50% of training data compared to 0.704 for a baseline model with full data.

In this paper, we present our approach to extracting structured information from unstructured Electronic Health Records (EHR) [2] which can be used to, for example, study adverse drug reactions in patients due to chemicals in their products. Our solution uses a combination of Natural Language Processing (NLP) techniques and a web-based annotation tool to optimize the performance of a custom Named Entity Recognition (NER) [1] model trained on a limited amount of EHR training data. This work was presented at the first Health Search and Data Mining Workshop (HSDM 2020) [26]. We showcase a combination of tools and techniques leveraging the recent advancements in NLP aimed at targeting domain shifts by applying transfer learning and language model pre-training techniques [3]. We present a comparison of our technique to the current popular approaches and show the effective increase in performance of the NER model and the reduction in time to annotate data.A key observation of the results presented is that the F1 score of model (0.734) trained with our approach with just 50% of available training data outperforms the F1 score of the blank spaCy model without language model component (0.704) trained with 100% of the available training data. We also demonstrate an annotation tool to minimize domain expert time and the manual effort required to generate such a training dataset. Further, we plan to release the annotated dataset as well as the pre-trained model to the community to further research in medical health records.

View on arXiv PDF

Similar