Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs
This highlights a critical privacy flaw in healthcare data protection, posing risks to patient-provider trust, but it is incremental as it builds on existing critiques of de-identification methods.
The paper critiques HIPAA Safe Harbor de-identification for clinical notes, showing that modern LLMs can re-identify patients from scrubbed data by exploiting latent correlations, such as predicting neighborhoods from diagnoses alone.
Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.