Tracking and Planning with Spatial World Models
This addresses navigation for robotics or autonomous systems, but it is incremental as it adapts existing techniques like differentiable rendering and pose estimation to a new setting.
The paper tackles real-time vision-based navigation by planning in a learned 3D spatial world model with differentiable rendering, achieving up to 92% navigation success rate at 15 Hz in simulated environments.
We introduce a method for real-time navigation and tracking with differentiably rendered world models. Learning models for control has led to impressive results in robotics and computer games, but this success has yet to be extended to vision-based navigation. To address this, we transfer advances in the emergent field of differentiable rendering to model-based control. We do this by planning in a learned 3D spatial world model, combined with a pose estimation algorithm previously used in the context of TSDF fusion, but now tailored to our setting and improved to incorporate agent dynamics. We evaluate over six simulated environments based on complex human-designed floor plans and provide quantitative results. We achieve up to 92% navigation success rate at a frequency of 15 Hz using only image and depth observations under stochastic, continuous dynamics.