ENTL: Embodied Navigation Trajectory Learner
This work addresses the challenge of learning efficient and generalizable representations for embodied AI tasks, though it appears incremental as it builds on existing sequence prediction and pre-training techniques.
The paper tackles the problem of embodied navigation by proposing ENTL, a method that unifies world modeling, localization, and imitation learning into a single sequence prediction task, achieving competitive performance on navigation tasks with significantly less data than baselines.
We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables sharing of the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments.