CLMay 25, 2021

Estimating Redundancy in Clinical Text

Thomas Searle, Zina Ibrahim, James Teo, Richard JB Dobson

arXiv:2105.11832v21.624 citationsHas Code

Originality Synthesis-oriented

AI Analysis

This addresses data duplication issues in healthcare documentation, which can propagate errors and affect care reporting, but it is incremental as it applies existing methods to a new domain.

The study quantified information redundancy in clinical text from Electronic Health Records, finding that language models trained on clinical text were 1.5x to 3x less efficient than open-domain models, with manual evaluations showing redundancy rates of 43% to 65%.

The current mode of use of Electronic Health Record (EHR) elicits text redundancy. Clinicians often populate new documents by duplicating existing notes, then updating accordingly. Data duplication can lead to a propagation of errors, inconsistencies and misreporting of care. Therefore, quantifying information redundancy can play an essential role in evaluating innovations that operate on clinical narratives. This work is a quantitative examination of information redundancy in EHR notes. We present and evaluate two strategies to measure redundancy: an information-theoretic approach and a lexicosyntactic and semantic model. We evaluate the measures by training large Transformer-based language models using clinical text from a large openly available US-based ICU dataset and a large multi-site UK based Trust. By comparing the information-theoretic content of the trained models with open-domain language models, the language models trained using clinical text have shown ~1.5x to ~3x less efficient than open-domain corpora. Manual evaluation shows a high correlation with lexicosyntactic and semantic redundancy, with averages ~43 to ~65%.

View on arXiv PDF Code

Similar