Few-shot Object Grounding and Mapping for Natural Language Robot Instruction Following
This work addresses the challenge of robot instruction-following in dynamic environments with unseen objects, representing an incremental advance in few-shot learning for robotics.
The paper tackles the problem of enabling robots to follow natural language instructions with new objects by introducing a few-shot object grounding method and a learned map representation, resulting in significant outperformance over prior state-of-the-art methods on a physical quadcopter control task.
We study the problem of learning a robot policy to follow natural language instructions that can be easily extended to reason about new objects. We introduce a few-shot language-conditioned object grounding method trained from augmented reality data that uses exemplars to identify objects and align them to their mentions in instructions. We present a learned map representation that encodes object locations and their instructed use, and construct it from our few-shot grounding output. We integrate this mapping approach into an instruction-following policy, thereby allowing it to reason about previously unseen objects at test-time by simply adding exemplars. We evaluate on the task of learning to map raw observations and instructions to continuous control of a physical quadcopter. Our approach significantly outperforms the prior state of the art in the presence of new objects, even when the prior approach observes all objects during training.