ROCVFeb 3

MVP-LAM: Learning Action-Centric Latent Action via Cross-Viewpoint Reconstruction

arXiv:2602.03668v13 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the challenge of scaling robot learning by enabling more effective vision-language-action model pretraining, though it is incremental as it builds on existing latent action methods.

The paper tackles the problem of learning latent actions from human videos for robot learning, proposing MVP-LAM which uses cross-viewpoint reconstruction to improve action-centricity, resulting in higher mutual information with ground-truth actions and better downstream manipulation performance on benchmarks like SIMPLER and LIBERO-Long.

Learning \emph{latent actions} from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have recently been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent's actions despite the absence of ground-truth labels. We propose \textbf{M}ulti-\textbf{V}iew\textbf{P}oint \textbf{L}atent \textbf{A}ction \textbf{M}odel (\textbf{MVP-LAM}), which learns discrete latent actions that are highly informative about ground-truth actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a \emph{cross-viewpoint reconstruction} objective, so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes