RO AI CVApr 14, 2025

Toward Aligning Human and Robot Actions via Multi-Modal Demonstration Learning

Azizul Zahid, Jie Fan, Farong Wang, Ashton Dy, Sai Swaminathan, Fei Liu

arXiv:2504.11493v1h-index: 2Has Code

Originality Incremental advance

AI Analysis

This work addresses action alignment in human-robot collaboration, particularly for manipulation tasks, but it is incremental as it builds on existing multimodal and transformer-based methods.

The paper tackles the problem of aligning human and robot actions for better collaboration and imitation learning by proposing a multimodal demonstration learning framework, achieving 71.67% accuracy for human intention modeling and 71.8% for robot action prediction on a pick-and-place task.

Understanding action correspondence between humans and robots is essential for evaluating alignment in decision-making, particularly in human-robot collaboration and imitation learning within unstructured environments. We propose a multimodal demonstration learning framework that explicitly models human demonstrations from RGB video with robot demonstrations in voxelized RGB-D space. Focusing on the "pick and place" task from the RH20T dataset, we utilize data from 5 users across 10 diverse scenes. Our approach combines ResNet-based visual encoding for human intention modeling and a Perceiver Transformer for voxel-based robot action prediction. After 2000 training epochs, the human model reaches 71.67% accuracy, and the robot model achieves 71.8% accuracy, demonstrating the framework's potential for aligning complex, multimodal human and robot behaviors in manipulation tasks.

View on arXiv PDF Code

Similar