Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind
This addresses a key challenge in human-robot collaboration by enabling more robust and context-aware instruction following, though it builds incrementally on existing vision-language models.
The paper tackles the problem of robots following human spoken instructions in noisy real-world collaboration by introducing SIFToM, a neurosymbolic model that uses theory of mind to interpret instructions pragmatically, resulting in significant performance improvements over state-of-the-art VLMs and approaching human-level accuracy.
Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot collaboration, following human spoken instructions can be challenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collaborative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings with human evaluations. Results show that SIFToM can significantly improve the performance of a lightweight base VLM (Gemini 2.5 Flash), outperforming state-of-the-art VLMs (Gemini 2.5 Pro) and approaching human-level accuracy on challenging spoken instruction following tasks.