Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance
This addresses data imbalance and spurious correlation issues in machine learning, offering a novel theoretical framework for synthetic data augmentation, though it is incremental in applying LLMs to a known bottleneck.
The paper tackles data imbalance in classification by developing a theoretical foundation for synthetic oversampling using large language models (LLMs), quantifying its benefits and deriving scaling laws, with experiments validating improved model performance.
Imbalanced classification and spurious correlation are common challenges in data science and machine learning. Both issues are linked to data imbalance, with certain groups of data samples significantly underrepresented, which in turn would compromise the accuracy, robustness and generalizability of the learned models. Recent advances have proposed leveraging the flexibility and generative capabilities of large language models (LLMs), typically built on transformer architectures, to generate synthetic samples and to augment the observed data. In the context of imbalanced data, LLMs are used to oversample underrepresented groups and have shown promising improvements. However, there is a clear lack of theoretical understanding of such synthetic data approaches. In this article, we develop novel theoretical foundations to systematically study the roles of synthetic samples in addressing imbalanced classification and spurious correlation. Specifically, we first explicitly quantify the benefits of synthetic oversampling. Next, we analyze the scaling dynamics in synthetic data augmentation, and derive the corresponding scaling law. Finally, we demonstrate the capacity of transformer models to generate high-quality synthetic samples. We further conduct extensive numerical experiments to validate the efficacy of the LLM-based synthetic oversampling and augmentation.