Action-conditioned Deep Visual Prediction with RoAM, a new Indoor Human Motion Dataset for Autonomous Robots
This work addresses the challenge of visual prediction for autonomous robots collaborating with humans in indoor settings, though it is incremental as it builds on existing video prediction methods with a new dataset and action-conditioning.
The authors tackled the problem of enabling robots to predict future visual scenes in indoor environments by introducing the RoAM dataset, which includes ego-vision videos, LiDAR scans, and robot actions, and they benchmarked it with ACPNet, a novel deep visual prediction framework that conditions predictions on robot actions, achieving competitive results on this new benchmark.
With the increasing adoption of robots across industries, it is crucial to focus on developing advanced algorithms that enable robots to anticipate, comprehend, and plan their actions effectively in collaboration with humans. We introduce the Robot Autonomous Motion (RoAM) video dataset, which is collected with a custom-made turtlebot3 Burger robot in a variety of indoor environments recording various human motions from the robot's ego-vision. The dataset also includes synchronized records of the LiDAR scan and all control actions taken by the robot as it navigates around static and moving human agents. The unique dataset provides an opportunity to develop and benchmark new visual prediction frameworks that can predict future image frames based on the action taken by the recording agent in partially observable scenarios or cases where the imaging sensor is mounted on a moving platform. We have benchmarked the dataset on our novel deep visual prediction framework called ACPNet where the approximated future image frames are also conditioned on action taken by the robot and demonstrated its potential for incorporating robot dynamics into the video prediction paradigm for mobile robotics and autonomous navigation research.