CL CYMar 7, 2018

Towards the Creation of a Large Corpus of Synthetically-Identified Clinical Notes

Willie Boag, Tristan Naumann, Peter Szolovits

arXiv:1803.02728v10.73 citations

Originality Synthesis-oriented

AI Analysis

This addresses the need for accessible data to train NLP models for medical research, but it is incremental as it builds on existing de-identification methods.

The paper tackled the problem of developing de-identification tools for clinical notes by creating a large synthetically-identified corpus with PHI annotations, and evaluated a tool on this corpus to assess its effectiveness.

Clinical notes often describe the most important aspects of a patient's physiology and are therefore critical to medical research. However, these notes are typically inaccessible to researchers without prior removal of sensitive protected health information (PHI), a natural language processing (NLP) task referred to as deidentification. Tools to automatically de-identify clinical notes are needed but are difficult to create without access to those very same notes containing PHI. This work presents a first step toward creating a large synthetically-identified corpus of clinical notes and corresponding PHI annotations in order to facilitate the development de-identification tools. Further, one such tool is evaluated against this corpus in order to understand the advantages and shortcomings of this approach.

View on arXiv PDF

Similar