The Ingredients for Robotic Diffusion Transformers
This work addresses the problem of efficiently designing high-capacity diffusion transformer policies for roboticists, enabling more general task-solving on dexterous hardware without extensive tuning, though it appears incremental as it builds on existing Transformer and diffusion model improvements.
The paper tackles the challenge of combining Transformer architectures with diffusion models for robotic control by identifying and improving key design decisions, resulting in a novel architecture that significantly outperforms state-of-the-art methods on long-horizon dexterous tasks, such as achieving better performance on a bi-manual ALOHA robot over 1500+ time-steps and showing improved scaling with 10 hours of multi-modal data.
In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ($1500+$ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: https://dit-policy.github.io