CVNov 26, 2020

A Recurrent Vision-and-Language BERT for Navigation

arXiv:2011.13922v2446 citations
AI Analysis

This work provides a more efficient and generalizable transformer-based solution for vision-and-language navigation tasks, benefiting researchers and developers working on embodied AI agents.

This paper proposes a recurrent BERT model for vision-and-language navigation (VLN) that addresses the challenge of history-dependent attention and decision-making in partially observable Markov decision processes. The model achieves state-of-the-art results on R2R and REVERIE datasets, outperforming more complex encoder-decoder models.

Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language(V&L) BERT. However, its application for the task of vision-and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is capable of solving navigation and referring expression tasks simultaneously.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes