A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective
This addresses privacy concerns for users of tabular data generation by providing a data-centric approach to reduce memorization, though it is incremental as it builds on prior dataset-level augmentation methods.
The paper tackles the problem of privacy risks in tabular diffusion models by identifying which individual training samples contribute most to memorization, revealing a heavy-tailed distribution where a small subset causes disproportionate leakage. It proposes DynamicCut, a model-agnostic mitigation method that reduces memorization with minimal impact on data diversity and downstream performance, and shows cross-model transferability to other generative models like GANs and VAEs.
Diffusion models have shown strong performance in generating high-quality tabular data, but they carry privacy risks by reproducing exact training samples. While prior work focuses on dataset-level augmentation to reduce memorization, little is known about which individual samples contribute most. We present the first data-centric study of memorization dynamics in tabular diffusion models. We quantify memorization for each real sample based on how many generated samples are flagged as replicas, using a relative distance ratio. Our empirical analysis reveals a heavy-tailed distribution of memorization counts: a small subset of samples contributes disproportionately to leakage, confirmed via sample-removal experiments. To understand this, we divide real samples into top- and non-top-memorized groups and analyze their training-time behaviors. We track when each sample is first memorized and monitor per-epoch memorization intensity (AUC). Memorized samples are memorized slightly earlier and show stronger signals in early training. Based on these insights, we propose DynamicCut, a two-stage, model-agnostic mitigation method: (a) rank samples by epoch-wise intensity, (b) prune a tunable top fraction, and (c) retrain on the filtered dataset. Across multiple tabular datasets and models, DynamicCut reduces memorization with minimal impact on data diversity and downstream performance. It also complements augmentation-based defenses. Furthermore, DynamicCut enables cross-model transferability: high-ranked samples identified from one model (e.g., a diffusion model) are also effective for reducing memorization when removed from others, such as GANs and VAEs.