Skill Transformer: A Monolithic Policy for Mobile Manipulation
This addresses mobile manipulation challenges for robotics by improving task planning and control in new scenarios, though it is incremental as it builds on existing transformer and modular skill methods.
The paper tackles long-horizon robotic tasks by introducing Skill Transformer, a monolithic policy that combines conditional sequence modeling and skill modularity to predict high-level skills and low-level actions end-to-end, achieving a 2.5x higher success rate than baselines in hard rearrangement problems.
We present Skill Transformer, an approach for solving long-horizon robotic tasks by combining conditional sequence modeling and skill modularity. Conditioned on egocentric and proprioceptive observations of a robot, Skill Transformer is trained end-to-end to predict both a high-level skill (e.g., navigation, picking, placing), and a whole-body low-level action (e.g., base and arm motion), using a transformer architecture and demonstration trajectories that solve the full task. It retains the composability and modularity of the overall task through a skill predictor module while reasoning about low-level actions and avoiding hand-off errors, common in modular approaches. We test Skill Transformer on an embodied rearrangement benchmark and find it performs robust task planning and low-level control in new scenarios, achieving a 2.5x higher success rate than baselines in hard rearrangement problems.