Enhancing Speech Instruction Understanding and Disambiguation in Robotics via Speech Prosody
This addresses the challenge of human-robot communication by improving disambiguation of ambiguous speech instructions, though it is incremental as it builds on existing methods with a novel integration.
The paper tackles the problem of robots accurately interpreting spoken language instructions by leveraging speech prosody to infer intent, achieving 95.79% accuracy in detecting referent intents and 71.96% accuracy in determining task plans for ambiguous instructions.
Enabling robots to accurately interpret and execute spoken language instructions is essential for effective human-robot collaboration. Traditional methods rely on speech recognition to transcribe speech into text, often discarding crucial prosodic cues needed for disambiguating intent. We propose a novel approach that directly leverages speech prosody to infer and resolve instruction intent. Predicted intents are integrated into large language models via in-context learning to disambiguate and select appropriate task plans. Additionally, we present the first ambiguous speech dataset for robotics, designed to advance research in speech disambiguation. Our method achieves 95.79% accuracy in detecting referent intents within an utterance and determines the intended task plan of ambiguous instructions with 71.96% accuracy, demonstrating its potential to significantly improve human-robot communication.