SENTINEL: A Fully End-to-End Language-Action Model for Humanoid Whole Body Control
This addresses the challenge of tight language-action alignment for humanoid robots, though it appears incremental by integrating existing techniques like flow matching and residual refinement.
The paper tackles the problem of aligning language commands with physical behaviors in humanoid control by introducing SENTINEL, an end-to-end language-action model that directly maps language and proprioceptive inputs to low-level actions, achieving strong semantic understanding and stable execution in simulation and real-world deployment.
Existing humanoid control systems often rely on teleoperation or modular generation pipelines that separate language understanding from physical execution. However, the former is entirely human-driven, and the latter lacks tight alignment between language commands and physical behaviors. In this paper, we present SENTINEL, a fully end-to-end language-action model for humanoid whole-body control. We construct a large-scale dataset by tracking human motions in simulation using a pretrained whole body controller, combined with their text annotations. The model directly maps language commands and proprioceptive inputs to low-level actions without any intermediate representation. The model generates action chunks using flow matching, which can be subsequently refined by a residual action head for real-world deployment. Our method exhibits strong semantic understanding and stable execution on humanoid robots in both simulation and real-world deployment, and also supports multi-modal extensions by converting inputs into texts.