CVGRLGMLAug 16, 2020

AutoSimulate: (Quickly) Learning Synthetic Data Generation

arXiv:2008.08424v125 citations
Originality Highly original
AI Analysis

This addresses the efficiency bottleneck in synthetic data generation for machine learning, offering a practical solution for researchers and practitioners using simulation-based datasets.

The paper tackles the high computational cost of optimizing simulator parameters for synthetic data generation by introducing a differentiable approximation of the objective, enabling faster optimization with fewer evaluations. It demonstrates up to 50x speedup, 30x reduction in training data generation, and an 8.7% accuracy improvement on real-world datasets compared to previous methods.

Simulation is increasingly being used for generating large labelled datasets in many machine learning problems. Recent methods have focused on adjusting simulator parameters with the goal of maximising accuracy on a validation task, usually relying on REINFORCE-like gradient estimators. However these approaches are very expensive as they treat the entire data generation, model training, and validation pipeline as a black-box and require multiple costly objective evaluations at each iteration. We propose an efficient alternative for optimal synthetic data generation, based on a novel differentiable approximation of the objective. This allows us to optimize the simulator, which may be non-differentiable, requiring only one objective evaluation at each iteration with a little overhead. We demonstrate on a state-of-the-art photorealistic renderer that the proposed method finds the optimal data distribution faster (up to $50\times$), with significantly reduced training data generation (up to $30\times$) and better accuracy ($+8.7\%$) on real-world test datasets than previous methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes