Modular Framework for Visuomotor Language Grounding
This addresses data collection challenges in robotics and grounded language research, though it appears incremental as it builds on existing modular approaches.
The authors tackled the problem of data inefficiency in natural language instruction following tasks by proposing a modular framework (LAV) that separates language, action, and vision into independently trainable modules, and they demonstrated its effectiveness with a preliminary evaluation on the ALFRED task.
Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research. However, data collection for these tasks is expensive and end-to-end approaches suffer from data inefficiency. We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently. Using a Language, Action, and Vision (LAV) framework removes the dependence of action and vision modules on instruction following datasets, making them more efficient to train. We also present a preliminary evaluation of LAV on the ALFRED task for visual and interactive instruction following.