Learning Actions from Human Demonstration Video for Robotic Manipulation
This work addresses the challenge of enabling robots to interpret human demonstrations for manipulation tasks, though it is incremental as it builds on existing video captioning methods.
The paper tackles the problem of learning robotic manipulation actions from human demonstration videos by improving video captioning to focus on objects of interest, resulting in more accurate commands and robust grasping performance on a UR5 robotic arm.
Learning actions from human demonstration is an emerging trend for designing intelligent robotic systems, which can be referred as video to command. The performance of such approach highly relies on the quality of video captioning. However, the general video captioning methods focus more on the understanding of the full frame, lacking of consideration on the specific object of interests in robotic manipulations. We propose a novel deep model to learn actions from human demonstration video for robotic manipulation. It consists of two deep networks, grasp detection network (GNet) and video captioning network (CNet). GNet performs two functions: providing grasp solutions and extracting the local features for the object of interests in robotic manipulation. CNet outputs the captioning results by fusing the features of both full frames and local objects. Experimental results on UR5 robotic arm show that our method could produce more accurate command from video demonstration than state-of-the-art work, thereby leading to more robust grasping performance.