Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives
This addresses the deployability and planning limitations in goal-conditioned visuomotor control for robotics, offering a more efficient approach without predefined primitives or demonstrations, though it appears incremental as it builds on existing conditioning ideas.
The paper tackled the problem of achieving versatile skill primitives in visuomotor control by proposing an end-to-end conditioning scheme that predicts action sequences from raw images and target distances, resulting in significant improvements in task success over baselines like MPC and IL, and demonstrating generalization to unseen tasks with visual noise and cluttered scenes.
Visuomotor control (VMC) is an effective means of achieving basic manipulation tasks such as pushing or pick-and-place from raw images. Conditioning VMC on desired goal states is a promising way of achieving versatile skill primitives. However, common conditioning schemes either rely on task-specific fine tuning - e.g. using one-shot imitation learning (IL) - or on sampling approaches using a forward model of scene dynamics i.e. model-predictive control (MPC), leaving deployability and planning horizon severely limited. In this paper we propose a conditioning scheme which avoids these pitfalls by learning the controller and its conditioning in an end-to-end manner. Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion and the distance to a given target observation. In contrast to related works, this enables our approach to efficiently perform complex manipulation tasks from raw image observations without predefined control primitives or test time demonstrations. We report significant improvements in task success over representative MPC and IL baselines. We also demonstrate our model's generalisation capabilities in challenging, unseen tasks featuring visual noise, cluttered scenes and unseen object geometries.