Transformer-based Localization from Embodied Dialog with Large-scale Pre-training
This addresses the problem of agent localization in unknown environments for embodied AI, representing an incremental improvement over prior methods.
The paper tackles the task of Localization via Embodied Dialog (LED) by developing a novel LED-Bert architecture with a graph-based scene representation, which outperforms previous baselines.
We address the challenging task of Localization via Embodied Dialog (LED). Given a dialog from two agents, an Observer navigating through an unknown environment and a Locator who is attempting to identify the Observer's location, the goal is to predict the Observer's final location in a map. We develop a novel LED-Bert architecture and present an effective pretraining strategy. We show that a graph-based scene representation is more effective than the top-down 2D maps used in prior works. Our approach outperforms previous baselines.