CL LGDec 2, 2020

Extracting COVID-19 Diagnoses and Symptoms From Clinical Text: A New Annotated Corpus and Neural Event Extraction Framework

Kevin Lybarger, Mari Ostendorf, Matthew Thompson, Meliha Yetisgen

arXiv:2012.00974v21.33 citations

Originality Incremental advance

AI Analysis

This work provides a new corpus and method for automatically extracting COVID-19 related information from clinical text, which is crucial for large-scale studies tracking the pandemic, understanding symptomology, and predicting infection severity for public health researchers and clinicians.

This paper introduces the COVID-19 Annotated Clinical Text (CACT) Corpus, a new dataset of 1,472 clinical notes with detailed annotations for COVID-19 diagnoses, testing, and symptoms. They developed a span-based event extraction model that achieved high F1 scores of 0.83-0.97 for event identification and 0.73-0.79 for assertion values. The automatically extracted symptoms improved the prediction of COVID-19 test results when combined with structured patient data.

Coronavirus disease 2019 (COVID-19) is a global pandemic. Although much has been learned about the novel coronavirus since its emergence, there are many open questions related to tracking its spread, describing symptomology, predicting the severity of infection, and forecasting healthcare utilization. Free-text clinical notes contain critical information for resolving these questions. Data-driven, automatic information extraction models are needed to use this text-encoded information in large-scale studies. This work presents a new clinical corpus, referred to as the COVID-19 Annotated Clinical Text (CACT) Corpus, which comprises 1,472 notes with detailed annotations characterizing COVID-19 diagnoses, testing, and clinical presentation. We introduce a span-based event extraction model that jointly extracts all annotated phenomena, achieving high performance in identifying COVID-19 and symptom events with associated assertion values (0.83-0.97 F1 for events and 0.73-0.79 F1 for assertions). In a secondary use application, we explored the prediction of COVID-19 test results using structured patient data (e.g. vital signs and laboratory results) and automatically extracted symptom information. The automatically extracted symptoms improve prediction performance, beyond structured data alone.

View on arXiv PDF

Similar