CV LGAug 26, 2024

Exploring the Potential of Synthetic Data to Replace Real Data

Hyungtae Lee, Yan Zhang, Heesung Kwon, Shuvra S. Bhattacharrya

arXiv:2408.14559v17.66 citationsh-index: 16

Originality Synthesis-oriented

AI Analysis

This work addresses the data scarcity problem in AI by exploring synthetic data's potential, but it is incremental as it builds on existing methods with new metrics and insights.

The paper investigates how synthetic data can replace real data in AI training, particularly when combined with a small set of cross-domain real images, and finds that its effectiveness depends on factors like the number of real images and the test set, introducing new metrics to analyze these dynamics.

The potential of synthetic data to replace real data creates a huge demand for synthetic data in data-hungry AI. This potential is even greater when synthetic data is used for training along with a small number of real images from domains other than the test domain. We find that this potential varies depending on (i) the number of cross-domain real images and (ii) the test set on which the trained model is evaluated. We introduce two new metrics, the train2test distance and $\text{AP}_\text{t2t}$, to evaluate the ability of a cross-domain training set using synthetic data to represent the characteristics of test instances in relation to training performance. Using these metrics, we delve deeper into the factors that influence the potential of synthetic data and uncover some interesting dynamics about how synthetic data impacts training performance. We hope these discoveries will encourage more widespread use of synthetic data.

View on arXiv PDF

Similar