Deduplication in a massive clinical note dataset
This addresses data quality issues in healthcare analytics by enabling efficient deduplication for large-scale clinical datasets, though it is incremental as it builds on existing hashing and clustering techniques.
The authors tackled the problem of detecting and correcting duplicates in massive clinical note datasets, presenting a scalable solution based on Minhashing with Locality Sensitive Hashing that handles over 10 million notes.
Duplication, whether exact or partial, is a common issue in many datasets. In clinical notes data, duplication (and near duplication) can arise for many reasons, such as the pervasive use of templates, copy-pasting, or notes being generated by automated procedures. A key challenge in removing such near duplicates is the size of such datasets; our own dataset consists of more than 10 million notes. To detect and correct such duplicates requires algorithms that both accurate and highly scalable. We describe a solution based on Minhashing with Locality Sensitive Hashing. In this paper, we present the theory behind this method and present a database-inspired approach to make the method scalable. We also present a clustering technique using disjoint sets to produce dense clusters, which speeds up our algorithm.