RO CVMar 12, 2024

VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training

Mohammad Nazeri, Junzhe Wang, Amirreza Payandeh, Xuesu Xiao

arXiv:2403.08109v34.112 citationsh-index: 8Has CodeIROS

Originality Incremental advance

AI Analysis

This addresses the challenge of efficient and effective navigation for robots by reducing reliance on large labeled datasets and computation, though it is incremental in applying self-supervised learning to this domain.

The paper tackles the problem of robotic visual navigation by proposing a self-supervised vision-action pre-training model (VANP) that learns to focus on visual regions relevant to navigation, achieving comparable performance to end-to-end models with half the training time and to ImageNet-trained models using only 0.08% of the data.

Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. However, most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects -- not necessarily relevant to navigation and potentially misleading. Alternative approaches train specialized navigation models from scratch, requiring significant computation. On the other hand, self-supervised learning has revolutionized computer vision and natural language processing, but its application to robotic navigation remains underexplored due to the difficulty of defining effective self-supervision signals. Motivated by these observations, in this work, we propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP). Instead of detecting salient objects that are beneficial for tasks such as classification or detection, VANP learns to focus only on specific visual regions that are relevant to the navigation task. To achieve this, VANP uses a history of visual observations, future actions, and a goal image for self-supervision, and embeds them using two small Transformer Encoders. Then, VANP maximizes the information between the embeddings by using a mutual information maximization objective function. We demonstrate that most VANP-extracted features match with human navigation intuition. VANP achieves comparable performance as models learned end-to-end with half the training time and models trained on a large-scale, fully supervised dataset, i.e., ImageNet, with only 0.08% data.

View on arXiv PDF Code

Similar