CLCRFeb 26, 2024

LLM-based Privacy Data Augmentation Guided by Knowledge Distillation with a Distribution Tutor for Medical Text Classification

arXiv:2402.16515v118 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses privacy-preserving data augmentation for medical text classification, offering a novel approach to differential privacy in synthetic data generation, though it is incremental in combining existing techniques.

The paper tackles the problem of generating differentially private synthetic text data for medical classification by proposing a method that uses a large language model and a discriminator guided by knowledge distillation, achieving competitive classification performance with a privacy budget of ε=2.0.

As sufficient data are not always publically accessible for model training, researchers exploit limited data with advanced learning algorithms or expand the dataset via data augmentation (DA). Conducting DA in private domain requires private protection approaches (i.e. anonymization and perturbation), but those methods cannot provide protection guarantees. Differential privacy (DP) learning methods theoretically bound the protection but are not skilled at generating pseudo text samples with large models. In this paper, we transfer DP-based pseudo sample generation task to DP-based generated samples discrimination task, where we propose a DP-based DA method with a LLM and a DP-based discriminator for text classification on private domains. We construct a knowledge distillation model as the DP-based discriminator: teacher models, accessing private data, teaches students how to select private samples with calibrated noise to achieve DP. To constrain the distribution of DA's generation, we propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost. We theoretically analyze our model's privacy protection and empirically verify our model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes