RO AI CVNov 22, 2021

Talk-to-Resolve: Combining scene understanding and spatial dialogue to resolve granular task ambiguity for a collocated robot

Pradip Pramanick, Chayan Sarkar, Snehasis Banerjee, Brojeshwar Bhowmick

arXiv:2111.11099v211.621 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of intuitive human-robot interaction for collocated robots, though it is incremental as it builds on existing methods for scene understanding and dialogue.

The paper tackles the problem of robots facing unforeseen circumstances when executing tasks from natural language instructions, by introducing Talk-to-Resolve (TTR), a system that uses scene understanding and spatial dialogue to resolve ambiguities, achieving 82% accuracy in identifying and resolving stalemates and generating more natural questions in user studies.

The utility of collocating robots largely depends on the easy and intuitive interaction mechanism with the human. If a robot accepts task instruction in natural language, first, it has to understand the user's intention by decoding the instruction. However, while executing the task, the robot may face unforeseeable circumstances due to the variations in the observed scene and therefore requires further user intervention. In this article, we present a system called Talk-to-Resolve (TTR) that enables a robot to initiate a coherent dialogue exchange with the instructor by observing the scene visually to resolve the impasse. Through dialogue, it either finds a cue to move forward in the original plan, an acceptable alternative to the original plan, or affirmation to abort the task altogether. To realize the possible stalemate, we utilize the dense captions of the observed scene and the given instruction jointly to compute the robot's next action. We evaluate our system based on a data set of initial instruction and situational scene pairs. Our system can identify the stalemate and resolve them with appropriate dialogue exchange with 82% accuracy. Additionally, a user study reveals that the questions from our systems are more natural (4.02 on average on a scale of 1 to 5) as compared to a state-of-the-art (3.08 on average).

View on arXiv PDF

Similar