Scaling-Aware Data Selection for End-to-End Autonomous Driving Systems
This addresses data efficiency for autonomous driving systems, offering a scalable solution to reduce training data requirements, though it is incremental as it builds on existing data selection and scaling law concepts.
The paper tackled the problem of ambiguous data selection for training large-scale models in physical AI applications by proposing MOSAIC, a framework that optimizes data mixtures using scaling laws, and applied it to autonomous driving, achieving up to 80% less data needed for competitive performance on driving metrics.
Large-scale deep learning models for physical AI applications depend on diverse training data collection efforts. These models and correspondingly, the training data, must address different evaluation criteria necessary for the models to be deployable in real-world environments. Data selection policies can guide the development of the training set, but current frameworks do not account for the ambiguity in how data points affect different metrics. In this work, we propose Mixture Optimization via Scaling-Aware Iterative Collection (MOSAIC), a general data selection framework that operates by: (i) partitioning the dataset into domains; (ii) fitting neural scaling laws from each data domain to the evaluation metrics; and (iii) optimizing a data mixture by iteratively adding data from domains that maximize the change in metrics. We apply MOSAIC to autonomous driving (AD), where an End-to-End (E2E) planner model is evaluated on the Extended Predictive Driver Model Score (EPDMS), an aggregate of driving rule compliance metrics. Here, MOSAIC outperforms a diverse set of baselines on EPDMS with up to 80\% less data.