ROMay 16

LACE: Latent Visual Representation for Cross-Embodiment Learning

Yoo Sung Jang, Kanchana Ranasinghe, Cristina Mata, Yichi Zhang, Jorge Mendez-Mendez, Michael S. Ryoo

arXiv:2605.1674384.6

AI Analysis

For robot learning, LACE addresses the visual gap between human and robot embodiments, allowing effective cross-embodiment learning with minimal robot data.

LACE aligns human and robot visual representations in the latent space of SSL backbones using sparse supervision from shared body parts, enabling robot policies to leverage human demonstration data. In zero-shot transfer, policies using LACE-DINO outperform those using DINO by 65%.

Cross-embodiment learning from human demonstrations is hindered by the visual gap between human and robot embodiments. While self-supervised learning (SSL) backbones encode rich inter-class semantics of general objects, we show they fail to establish correspondence between human and robot hands. We propose LACE, a framework that aligns human and robot visual representations in the latent space of these backbones by leveraging correspondences between shared body parts across embodiments as sparse supervision. These annotations can be automatically obtained via forward kinematics, and single robot demonstration is sufficient to train the model. Our semantic alignment loss matches distributions incurred by corresponding features, lifting patch-level supervision to semantic-level alignment, while a Gram loss preserves pretrained feature quality. This alignment enables robot policies to leverage abundant human data when robot demonstrations are scarce: in zero-shot transfer, policies using LACE-DINO outperform those using DINO by a large margin (65\%), with consistent gains in low-data regimes and out-of-distribution environments.

View on arXiv PDF

Similar