CLCYMar 7, 2018

Towards the Creation of a Large Corpus of Synthetically-Identified Clinical Notes

arXiv:1803.02728v13 citations
Originality Synthesis-oriented
AI Analysis

This addresses the need for accessible data to train NLP models for medical research, but it is incremental as it builds on existing de-identification methods.

The paper tackled the problem of developing de-identification tools for clinical notes by creating a large synthetically-identified corpus with PHI annotations, and evaluated a tool on this corpus to assess its effectiveness.

Clinical notes often describe the most important aspects of a patient's physiology and are therefore critical to medical research. However, these notes are typically inaccessible to researchers without prior removal of sensitive protected health information (PHI), a natural language processing (NLP) task referred to as deidentification. Tools to automatically de-identify clinical notes are needed but are difficult to create without access to those very same notes containing PHI. This work presents a first step toward creating a large synthetically-identified corpus of clinical notes and corresponding PHI annotations in order to facilitate the development de-identification tools. Further, one such tool is evaluated against this corpus in order to understand the advantages and shortcomings of this approach.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes