CLSep 4, 2018

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction

Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, Yoav Artzi

arXiv:1809.00786v233.01165 citationsHas Code

Originality Highly original

AI Analysis

This work addresses instruction following for agents in navigation and household domains, presenting a novel decomposition approach with new benchmarks.

The paper tackles instruction following in 3D environments by decomposing it into visual goal prediction and action generation, using LINGUNET for goal mapping and training from demonstrations without external resources. It introduces LANI and CHAI benchmarks, showing advantages of the decomposition and highlighting challenges in these tasks.

We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions. Our evaluation demonstrates the advantages of our model decomposition, and illustrates the challenges posed by our new benchmarks.

View on arXiv PDF Code

Similar