Visual Pre-training for Navigation: What Can We Learn from Noise?
This work addresses data inefficiency in visual navigation for robotics or AI systems, presenting an incremental improvement over existing methods.
The paper tackles the data inefficiency of end-to-end visual navigation systems by proposing a self-supervised method that learns representations from synthetic noise images, which transfers to natural images and enables efficient policy learning with minimal interaction data.
One powerful paradigm in visual navigation is to predict actions from observations directly. Training such an end-to-end system allows representations useful for downstream tasks to emerge automatically. However, the lack of inductive bias makes this system data inefficient. We hypothesize a sufficient representation of the current view and the goal view for a navigation policy can be learned by predicting the location and size of a crop of the current view that corresponds to the goal. We further show that training such random crop prediction in a self-supervised fashion purely on synthetic noise images transfers well to natural home images. The learned representation can then be bootstrapped to learn a navigation policy efficiently with little interaction data. The code is available at https://yanweiw.github.io/noise2ptz