Cross-view Self-localization from Synthesized Scene-graphs
This work addresses the problem of visual place recognition in sparse viewpoints for robotics or AR applications, representing an incremental improvement by combining existing techniques to mitigate quality and storage issues.
The paper tackles cross-view self-localization by proposing a hybrid scene model that fuses view-invariant appearance features from raw images with view-dependent spatial-semantic features from synthesized images, achieving improved performance on a novel dataset generated with a photorealistic Habitat simulator.
Cross-view self-localization is a challenging scenario of visual place recognition in which database images are provided from sparse viewpoints. Recently, an approach for synthesizing database images from unseen viewpoints using NeRF (Neural Radiance Fields) technology has emerged with impressive performance. However, synthesized images provided by these techniques are often of lower quality than the original images, and furthermore they significantly increase the storage cost of the database. In this study, we explore a new hybrid scene model that combines the advantages of view-invariant appearance features computed from raw images and view-dependent spatial-semantic features computed from synthesized images. These two types of features are then fused into scene graphs, and compressively learned and recognized by a graph neural network. The effectiveness of the proposed method was verified using a novel cross-view self-localization dataset with many unseen views generated using a photorealistic Habitat simulator.