Kahyun Lee

CL
3papers
13citations
Novelty33%
AI Score18

3 Papers

CLFeb 17, 2021
Performance of Automatic De-identification Across Different Note Types

Nicholas Dobbins, David Wayne, Kahyun Lee et al.

Free-text clinical notes detail all aspects of patient care and have great potential to facilitate quality improvement and assurance initiatives as well as advance clinical research. However, concerns about patient privacy and confidentiality limit the use of clinical notes for research. As a result, the information documented in these notes remains unavailable for most researchers. De-identification (de-id), i.e., locating and removing personally identifying protected health information (PHI), is one way of improving access to clinical narratives. However, there are limited off-the-shelf de-identification systems able to consistently detect PHI across different data sources and medical specialties. In this abstract, we present the performance of a state-of-the art de-id system called NeuroNER1 on a diverse set of notes from University of Washington (UW) when the models are trained on data from an external institution (Partners Healthcare) vs. from the same institution (UW). We present results at the level of PHI and note types.

CLFeb 17, 2021
Transferability of Neural Network Clinical De-identification Systems

Kahyun Lee, Nicholas J. Dobbins, Bridget McInnes et al.

Objective: Neural network de-identification studies have focused on individual datasets. These studies assume the availability of a sufficient amount of human-annotated data to train models that can generalize to corresponding test data. In real-world situations, however, researchers often have limited or no in-house training data. Existing systems and external data can help jump-start de-identification on in-house data; however, the most efficient way of utilizing existing systems and external data is unclear. This article investigates the transferability of a state-of-the-art neural clinical de-identification system, NeuroNER, across a variety of datasets, when it is modified architecturally for domain generalization and when it is trained strategically for domain transfer. Methods and Materials: We conducted a comparative study of the transferability of NeuroNER using four clinical note corpora with multiple note types from two institutions. We modified NeuroNER architecturally to integrate two types of domain generalization approaches. We evaluated each architecture using three training strategies. We measured: transferability from external sources; transferability across note types; the contribution of external source data when in-domain training data are available; and transferability across institutions. Results and Conclusions: Transferability from a single external source gave inconsistent results. Using additional external sources consistently yielded an F1-score of approximately 80%. Fine-tuning emerged as a dominant transfer strategy, with or without domain generalization. We also found that external sources were useful even in cases where in-domain training data were available. Transferability across institutions differed by note type and annotation label but resulted in improved performance.

CLFeb 17, 2021
A Context-Enhanced De-identification System

Kahyun Lee, Mehmet Kayaalp, Sam Henry et al.

Many modern entity recognition systems, including the current state-of-the-art de-identification systems, are based on bidirectional long short-term memory (biLSTM) units augmented by a conditional random field (CRF) sequence optimizer. These systems process the input sentence by sentence. This approach prevents the systems from capturing dependencies over sentence boundaries and makes accurate sentence boundary detection a prerequisite. Since sentence boundary detection can be problematic especially in clinical reports, where dependencies and co-references across sentence boundaries are abundant, these systems have clear limitations. In this study, we built a new system on the framework of one of the current state-of-the-art de-identification systems, NeuroNER, to overcome these limitations. This new system incorporates context embeddings through forward and backward n-grams without using sentence boundaries. Our context-enhanced de-identification (CEDI) system captures dependencies over sentence boundaries and bypasses the sentence boundary detection problem altogether. We enhanced this system with deep affix features and an attention mechanism to capture the pertinent parts of the input. The CEDI system outperforms NeuroNER on the 2006 i2b2 de-identification challenge dataset, the 2014 i2b2 shared task de-identification dataset, and the 2016 CEGS N-GRID de-identification dataset (p<0.01). All datasets comprise narrative clinical reports in English but contain different note types varying from discharge summaries to psychiatric notes. Enhancing CEDI with deep affix features and the attention mechanism further increased performance.