LGFeb 3, 2023

Leveraging Contaminated Datasets to Learn Clean-Data Distribution with Purified Generative Adversarial Networks

arXiv:2302.01722v15.33 citationsh-index: 16Has Code

Originality Incremental advance

AI Analysis

This work addresses a practical issue for machine learning practitioners by enabling more robust generative modeling in real-world scenarios where training data may be noisy or contaminated, though it is incremental as it builds upon existing GAN frameworks.

The paper tackles the problem of learning the desired data distribution from contaminated datasets using Purified Generative Adversarial Networks (PuriGAN), which leverage an extra dataset of contamination instances to improve discriminator capability, resulting in better image generation and superior performance in downstream tasks like anomaly detection and PU-learning compared to baselines.

Generative adversarial networks (GANs) are known for their strong abilities on capturing the underlying distribution of training instances. Since the seminal work of GAN, many variants of GAN have been proposed. However, existing GANs are almost established on the assumption that the training dataset is clean. But in many real-world applications, this may not hold, that is, the training dataset may be contaminated by a proportion of undesired instances. When training on such datasets, existing GANs will learn a mixture distribution of desired and contaminated instances, rather than the desired distribution of desired data only (target distribution). To learn the target distribution from contaminated datasets, two purified generative adversarial networks (PuriGAN) are developed, in which the discriminators are augmented with the capability to distinguish between target and contaminated instances by leveraging an extra dataset solely composed of contamination instances. We prove that under some mild conditions, the proposed PuriGANs are guaranteed to converge to the distribution of desired instances. Experimental results on several datasets demonstrate that the proposed PuriGANs are able to generate much better images from the desired distribution than comparable baselines when trained on contaminated datasets. In addition, we also demonstrate the usefulness of PuriGAN on downstream applications by applying it to the tasks of semi-supervised anomaly detection on contaminated datasets and PU-learning. Experimental results show that PuriGAN is able to deliver the best performance over comparable baselines on both tasks.

View on arXiv PDF Code

Similar