CVSep 28, 2023

End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon

Guillaume Bono, Leonid Antsfeld, Boris Chidlovskii, Philippe Weinzaepfel, Christian Wolf

arXiv:2309.16634v112.620 citationsh-index: 35

Originality Highly original

AI Analysis

This work solves the problem of image-based navigation in unseen environments for robotics and AI systems, representing an incremental advance by improving perception modules through novel training strategies.

The paper tackles the challenge of goal-oriented visual navigation when the goal is given as an exemplar image, by addressing the underlying visual correspondence problem through pretext tasks for relative pose estimation and visibility prediction. It achieves state-of-the-art performance on ImageNav and Instance-ImageNav benchmarks, with significant improvements in navigation success rates.

Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.

View on arXiv PDF

Similar