ROCVMay 13

SCAR: Self-Supervised Continuous Action Representation Learning

arXiv:2605.1641288.3
Predicted impact top 11% in RO · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the challenge of learning transferable action representations for world models that generalize across embodiments under limited data, offering a potential interface for more generalizable embodied AI.

SCAR learns unified action representations across embodiments from visual transitions using a joint inverse-forward dynamics framework with adversarial invariance, achieving improved cross-embodiment low-data adaptation and cross-task transfer on Procgen and Robotwin datasets.

Despite the central role of action in embodied intelligence, learning transferable action representations from visual transitions remains a fundamental challenge, particularly when world models must generalize across embodiments under limited data. We argue that action is not merely an auxiliary conditioning signal, but a distinct representational factor that decouples the controllable change from embodiment-specific actuation. In this work, we propose SCAR, a joint inverse-forward dynamics framework for learning unified action representations across embodiments from visual transitions. Built on a pretrained generative backbone, SCAR uses an inverse dynamics model (IDM) to infer latent actions from latent observation pairs and a forward dynamics model (FDM) to predict future dynamics conditioned on them. To make the latent space transferable rather than a generic visual bottleneck, we regularize the latent action posterior toward a standard Gaussian prior to limit arbitrary visual encoding, and introduce adversarial invariance to suppress embodiment- and environment-specific nuisance factors. Experiments on the Procgen and Robotwin dataset show that the learned unified latent action representation serves as a stronger conditioning interface for world modeling than embodiment-specific raw actions, yielding improved cross-embodiment low-data adaptation and cross-task transfer. Taken together, these results suggest that action can be learned as a shared representation of controllable change across embodiments, providing an interface for more transferable and generalizable world models.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes