Cross-Lingual Vision-Language Navigation
This work addresses the limitation of English dominance in VLN to serve multilingual users, though it is incremental as it extends an existing benchmark and method.
The paper tackles the problem of vision-language navigation (VLN) beyond English by introducing a bilingual dataset (BL-R2R) with Chinese instructions and studying zero-shot learning for cross-lingual navigation. The model, trained only on English data, achieves competitive results compared to models with full access to target language training data, and its transfer ability is investigated with limited target language data.
Commanding a robot to navigate with natural language instructions is a long-term goal for grounded language understanding and robotics. But the dominant language is English, according to previous studies on vision-language navigation (VLN). To go beyond English and serve people speaking different languages, we collect a bilingual Room-to-Room (BL-R2R) dataset, extending the original benchmark with new Chinese instructions. Based on this newly introduced dataset, we study how an agent can be trained on existing English instructions but navigate effectively with another language under a zero-shot learning scenario. Without any training data of the target language, our model shows competitive results even compared to a model with full access to the target language training data. Moreover, we investigate the transferring ability of our model when given a certain amount of target language training data.