Using Synthetic Data to Train Neural Networks is Model-Based Reasoning
This work addresses the challenge of ensuring neural network generalization when using synthetic data, which is crucial for applications like security and AI robustness.
The paper tackles the problem of training neural networks with synthetic data by formally connecting it to Bayesian model-based reasoning, and demonstrates state-of-the-art performance in a Captcha-breaking task, successfully breaking real-world Captchas from Facebook and Wikipedia.
We draw a formal connection between using synthetic training data to optimize neural network parameters and approximate, Bayesian, model-based reasoning. In particular, training a neural network using synthetic data can be viewed as learning a proposal distribution generator for approximate inference in the synthetic-data generative model. We demonstrate this connection in a recognition task where we develop a novel Captcha-breaking architecture and train it using synthetic data, demonstrating both state-of-the-art performance and a way of computing task-specific posterior uncertainty. Using a neural network trained this way, we also demonstrate successful breaking of real-world Captchas currently used by Facebook and Wikipedia. Reasoning from these empirical results and drawing connections with Bayesian modeling, we discuss the robustness of synthetic data results and suggest important considerations for ensuring good neural network generalization when training with synthetic data.