ROCVOct 2, 2025

Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning

arXiv:2510.02268v13 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses the challenge of robust robot manipulation under varying camera viewpoints, which is crucial for real-world deployment, though it is incremental as it builds on existing policy architectures.

The paper tackles the problem of view-invariant imitation learning by conditioning policies on camera extrinsics, showing that this approach significantly improves generalization across viewpoints for standard behavior cloning policies like ACT, Diffusion Policy, and SmolVLA, with performance restored in tasks where policies without extrinsics fail due to reliance on static background cues.

We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plucker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in RoboSuite and ManiSkill that pair "fixed" and "randomized" scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes; this shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code at https://ripl.github.io/know_your_camera/ .

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes