ROCVLGJun 2

PointAction: 3D Points as Universal Action Representations for Robot Control

arXiv:2606.0394395.9
Predicted impact top 5% in RO · last 90 daysOriginality Incremental advance
AI Analysis

For robot manipulation, PointAction reduces action grounding ambiguity from RGB-only video by using metric 3D point dynamics as an embodiment-agnostic interface, enabling transfer across tasks and embodiments with limited action supervision.

PointAction bridges video predictions to robot actions via explicit 3D point dynamics, achieving state-of-the-art 4D generation quality and outperforming baselines in simulation while generalizing to unseen real robot arms.

Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes