RoboSSM: Scalable In-context Imitation Learning via State-Space Models
This work addresses scalability issues in robot learning for few-shot adaptation, offering a more efficient alternative to Transformers, though it is incremental as it builds on existing ICIL methods by replacing the backbone model.
The paper tackles the computational limitations and poor extrapolation of Transformer-based in-context imitation learning (ICIL) for robots by introducing RoboSSM, which uses state-space models (Longhorn) to achieve linear-time inference and robust performance on long-context prompts, showing effectiveness on the LIBERO benchmark with high performance on unseen tasks and long-horizon scenarios.
In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. We evaluate our approach on the LIBERO benchmark and compare it against strong Transformer-based ICIL baselines. Experiments show that RoboSSM extrapolates effectively to varying numbers of in-context demonstrations, yields high performance on unseen tasks, and remains robust in long-horizon scenarios. These results highlight the potential of SSMs as an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.