ROAILGMay 12, 2025

Guiding Data Collection via Factored Scaling Curves

arXiv:2505.07728v110 citationsh-index: 11Has Code
Originality Highly original
AI Analysis

This addresses the challenge of expensive data collection for robotics manipulation tasks, offering a principled method to improve generalization with limited budgets.

The paper tackles the problem of efficiently collecting data for training generalist imitation learning policies by introducing factored scaling curves to quantify performance variations across environmental factors, enabling targeted data acquisition that boosts success rates by up to 26% over existing strategies.

Generalist imitation learning policies trained on large datasets show great promise for solving diverse manipulation tasks. However, to ensure generalization to different conditions, policies need to be trained with data collected across a large set of environmental factor variations (e.g., camera pose, table height, distractors) $-$ a prohibitively expensive undertaking, if done exhaustively. We introduce a principled method for deciding what data to collect and how much to collect for each factor by constructing factored scaling curves (FSC), which quantify how policy performance varies as data scales along individual or paired factors. These curves enable targeted data acquisition for the most influential factor combinations within a given budget. We evaluate the proposed method through extensive simulated and real-world experiments, across both training-from-scratch and fine-tuning settings, and show that it boosts success rates in real-world tasks in new environments by up to 26% over existing data-collection strategies. We further demonstrate how factored scaling curves can effectively guide data collection using an offline metric, without requiring real-world evaluation at scale.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes