CLSep 3, 2018

emrQA: A Large Corpus for Question Answering on Electronic Medical Records

arXiv:1809.00732v11146 citations
Originality Incremental advance
AI Analysis

This provides a domain-specific resource for QA in healthcare, addressing a bottleneck for researchers working with clinical notes, though it is incremental as it re-purposes existing annotations.

The authors tackled the lack of large-scale question answering datasets for electronic medical records by generating emrQA, a corpus with 1 million question-logical form and 400,000+ question-answer evidence pairs, using existing expert annotations from i2b2 datasets.

We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million question-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes