How Physics and Background Attributes Impact Video Transformers in Robotic Manipulation: A Case Study on Planar Pushing
This work addresses the problem of dataset composition for cost-effective robot learning, but it is incremental as it focuses on a specific case study.
The study investigated how physics attributes and background scene characteristics affect Video Transformers in predicting planar pushing trajectories, finding that changes in friction coefficient and background dynamics most harm generalization, and fine-tuning with 10-20% of data can adapt models to novel scenarios.
As model and dataset sizes continue to scale in robot learning, the need to understand how the composition and properties of a dataset affect model performance becomes increasingly urgent to ensure cost-effective data collection and model performance. In this work, we empirically investigate how physics attributes (color, friction coefficient, shape) and scene background characteristics, such as the complexity and dynamics of interactions with background objects, influence the performance of Video Transformers in predicting planar pushing trajectories. We investigate three primary questions: How do physics attributes and background scene characteristics influence model performance? What kind of changes in attributes are most detrimental to model generalization? What proportion of fine-tuning data is required to adapt models to novel scenarios? To facilitate this research, we present CloudGripper-Push-1K, a large real-world vision-based robot pushing dataset comprising 1278 hours and 460,000 videos of planar pushing interactions with objects with different physics and background attributes. We also propose Video Occlusion Transformer (VOT), a generic modular video-transformer-based trajectory prediction framework which features 3 choices of 2D-spatial encoders as the subject of our case study. The dataset and source code are available at https://cloudgripper.org.