CLCVMay 31, 2019

Multi-modal Discriminative Model for Vision-and-Language Navigation

arXiv:1905.13358v11106 citations
Originality Incremental advance
AI Analysis

This work addresses data scarcity in VLN for AI agents, but it is incremental as it builds on existing augmentation methods.

The paper tackled the problem of limited generalization in Vision-and-Language Navigation (VLN) due to expensive paired vision-language data, by developing a multi-modal discriminator to evaluate instruction-path alignment, resulting in improved agent performance with a 10% relative increase in success rates on unseen environments.

Vision-and-Language Navigation (VLN) is a natural language grounding task where agents have to interpret natural language instructions in the context of visual scenes in a dynamic environment to achieve prescribed navigation goals. Successful agents must have the ability to parse natural language of varying linguistic styles, ground them in potentially unfamiliar scenes, plan and react with ambiguous environmental feedback. Generalization ability is limited by the amount of human annotated data. In particular, \emph{paired} vision-language sequence data is expensive to collect. We develop a discriminator that evaluates how well an instruction explains a given path in VLN task using multi-modal alignment. Our study reveals that only a small fraction of the high-quality augmented data from \citet{Fried:2018:Speaker}, as scored by our discriminator, is useful for training VLN agents with similar performance on previously unseen environments. We also show that a VLN agent warm-started with pre-trained components from the discriminator outperforms the benchmark success rates of 35.5 by 10\% relative measure on previously unseen environments.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes