Analyzing Effects of Fake Training Data on the Performance of Deep Learning Systems
This addresses data scarcity and robustness issues in computer vision, but it is incremental as it builds on existing GAN-based synthetic data methods.
The paper analyzed how mixing synthetic data from GANs with real data affects deep learning models, finding that specific proportions can improve robustness to out-of-distribution data and prediction quality, though concrete numbers were not provided.
Deep learning models frequently suffer from various problems such as class imbalance and lack of robustness to distribution shift. It is often difficult to find data suitable for training beyond the available benchmarks. This is especially the case for computer vision models. However, with the advent of Generative Adversarial Networks (GANs), it is now possible to generate high-quality synthetic data. This synthetic data can be used to alleviate some of the challenges faced by deep learning models. In this work we present a detailed analysis of the effect of training computer vision models using different proportions of synthetic data along with real (organic) data. We analyze the effect that various quantities of synthetic data, when mixed with original data, can have on a model's robustness to out-of-distribution data and the general quality of predictions.