Toward Understanding Generative Data Augmentation
This work addresses a theoretical gap for researchers in machine learning, providing insights into when generative data augmentation is beneficial, though it is incremental as it builds on existing methods.
The paper tackles the lack of theoretical understanding of generative data augmentation by establishing a stability bound in a non-i.i.d. setting, showing it can improve learning guarantees at a constant level for small datasets, with simulation and empirical results supporting these conclusions.
Generative data augmentation, which scales datasets by obtaining fake labeled examples from a trained conditional generative model, boosts classification performance in various learning tasks including (semi-)supervised learning, few-shot learning, and adversarially robust learning. However, little work has theoretically investigated the effect of generative data augmentation. To fill this gap, we establish a general stability bound in this not independently and identically distributed (non-i.i.d.) setting, where the learned distribution is dependent on the original train set and generally not the same as the true distribution. Our theoretical result includes the divergence between the learned distribution and the true distribution. It shows that generative data augmentation can enjoy a faster learning rate when the order of divergence term is $o(\max\left( \log(m)β_m, 1 / \sqrt{m})\right)$, where $m$ is the train set size and $β_m$ is the corresponding stability constant. We further specify the learning setup to the Gaussian mixture model and generative adversarial nets. We prove that in both cases, though generative data augmentation does not enjoy a faster learning rate, it can improve the learning guarantees at a constant level when the train set is small, which is significant when the awful overfitting occurs. Simulation results on the Gaussian mixture model and empirical results on generative adversarial nets support our theoretical conclusions. Our code is available at https://github.com/ML-GSAI/Understanding-GDA.