Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images
This addresses the challenge of continuously updating maps for autonomous navigation, though it is incremental as it builds on cross-modal retrieval techniques.
The paper tackles the problem of inferring urban street map topology from ego-view images, which is needed for updating maps for self-driving vehicles, and demonstrates that their Pix2Map method can accurately retrieve street maps from images using the Argoverse dataset.
Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs.