On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline
This work addresses the problem of evaluating pre-training methods for visuo-motor control, showing that domain gaps hinder current approaches, and provides a strong baseline for benchmarking, which is incremental as it revisits and refines existing ideas.
The paper examined the effectiveness of pre-training for visuo-motor control tasks and found that a simple Learning-from-Scratch baseline with data augmentation and a shallow ConvNet is competitive with recent methods using frozen pre-trained visual representations, across various algorithms, tasks, and metrics in simulation and on a real robot.
In this paper, we examine the effectiveness of pre-training for visuo-motor control tasks. We revisit a simple Learning-from-Scratch (LfS) baseline that incorporates data augmentation and a shallow ConvNet, and find that this baseline is surprisingly competitive with recent approaches (PVR, MVP, R3M) that leverage frozen visual representations trained on large-scale vision datasets -- across a variety of algorithms, task domains, and metrics in simulation and on a real robot. Our results demonstrate that these methods are hindered by a significant domain gap between the pre-training datasets and current benchmarks for visuo-motor control, which is alleviated by finetuning. Based on our findings, we provide recommendations for future research in pre-training for control and hope that our simple yet strong baseline will aid in accurately benchmarking progress in this area.