Estimating Redundancy in Clinical Text
This addresses data duplication issues in healthcare documentation, which can propagate errors and affect care reporting, but it is incremental as it applies existing methods to a new domain.
The study quantified information redundancy in clinical text from Electronic Health Records, finding that language models trained on clinical text were 1.5x to 3x less efficient than open-domain models, with manual evaluations showing redundancy rates of 43% to 65%.
The current mode of use of Electronic Health Record (EHR) elicits text redundancy. Clinicians often populate new documents by duplicating existing notes, then updating accordingly. Data duplication can lead to a propagation of errors, inconsistencies and misreporting of care. Therefore, quantifying information redundancy can play an essential role in evaluating innovations that operate on clinical narratives. This work is a quantitative examination of information redundancy in EHR notes. We present and evaluate two strategies to measure redundancy: an information-theoretic approach and a lexicosyntactic and semantic model. We evaluate the measures by training large Transformer-based language models using clinical text from a large openly available US-based ICU dataset and a large multi-site UK based Trust. By comparing the information-theoretic content of the trained models with open-domain language models, the language models trained using clinical text have shown ~1.5x to ~3x less efficient than open-domain corpora. Manual evaluation shows a high correlation with lexicosyntactic and semantic redundancy, with averages ~43 to ~65%.