CVCLROOct 6, 2022

Iterative Vision-and-Language Navigation

UW
arXiv:2210.03087v342 citationsh-index: 44
AI Analysis

This addresses a disparity in robotics deployment where agents operate in the same environment over time, though it is incremental as it builds on existing VLN benchmarks.

The paper tackles the problem of evaluating language-guided agents in persistent environments by introducing the Iterative Vision-and-Language Navigation (IVLN) paradigm, which extends existing benchmarks to include memory across up to 100 episodes, and finds that map-building agents benefit from this persistence while transformer-based agents do not.

We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes, each defined by an individual language instruction and a target path. We present discrete and continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours each in 80 indoor scenes. We find that extending the implicit memory of high-performing transformer VLN agents is not sufficient for IVLN, but agents that build maps can benefit from environment persistence, motivating a renewed focus on map-building agents in VLN.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes