Embedding Morphology into Transformers for Cross-Robot Policy Learning
This work tackles the problem of improving robustness and performance of robot policies across different embodiments for robot learning researchers, offering an incremental improvement.
This paper addresses the challenge of cross-robot policy learning by proposing an embodiment-aware transformer policy. It integrates morphology through kinematic tokens, a topology-aware attention bias, and joint-attribute conditioning, consistently improving performance over a vanilla VLA baseline across various robotic embodiments.
Cross-robot policy learning -- training a single policy to perform well across multiple embodiments -- remains a central challenge in robot learning. Transformer-based policies, such as vision-language-action (VLA) models, are typically embodiment-agnostic and must infer kinematic structure purely from observations, which can reduce robustness across embodiments and even limit performance within a single embodiment. We propose an embodiment-aware transformer policy that injects morphology via three mechanisms: (1) kinematic tokens that factorize actions across joints and compress time through per-joint temporal chunking; (2) a topology-aware attention bias that encodes kinematic topology as an inductive bias in self-attention, encouraging message passing along kinematic edges; and (3) joint-attribute conditioning that augments topology with per-joint descriptors to capture semantics beyond connectivity. Across a range of embodiments, this structured integration consistently improves performance over a vanilla pi0.5 VLA baseline, indicating improved robustness both within an embodiment and across embodiments.