ROApr 23

VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

arXiv:2604.2191445.1
Predicted impact top 5% in RO · last 90 daysOriginality Incremental advance
AI Analysis

For robot manipulation systems, VistaBot addresses the problem of robustness to camera viewpoint changes without requiring test-time calibration, enabling more practical deployment.

VistaBot integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop robot manipulation, improving cross-view generalization by 2.79× and 2.63× over ACT and π0 policies, respectively.

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($π_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $π_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes