Iori Yanokura

RO
3papers
57citations
Novelty47%
AI Score23

3 Papers

CVDec 9, 2020
Understanding Action Sequences based on Video Captioning for Learning-from-Observation

Iori Yanokura, Naoki Wake, Kazuhiro Sasabuchi et al.

Learning actions from human demonstration video is promising for intelligent robotic systems. Extracting the exact section and re-observing the extracted video section in detail is important for imitating complex skills because human motions give valuable hints for robots. However, the general video understanding methods focus more on the understanding of the full frame,lacking consideration on extracting accurate sections and aligning them with the human's intent. We propose a Learning-from-Observation framework that splits and understands a video of a human demonstration with verbal instructions to extract accurate action sequences. The splitting is done based on local minimum points of the hand velocity, which align human daily-life actions with object-centered face contact transitions required for generating robot motion. Then, we extract a motion description on the split videos using video captioning techniques that are trained from our new daily-life action video dataset. Finally, we match the motion descriptions with the verbal instructions to understand the correct human intent and ignore the unintended actions inside the video. We evaluate the validity of hand velocity-based video splitting and demonstrate that it is effective. The experimental results on our new video captioning dataset focusing on daily-life human actions demonstrate the effectiveness of the proposed method. The source code, trained models, and the dataset will be made available.

ROAug 4, 2020
A Learning-from-Observation Framework: One-Shot Robot Teaching for Grasp-Manipulation-Release Household Operations

Naoki Wake, Riku Arakawa, Iori Yanokura et al.

A household robot is expected to perform various manipulative operations with an understanding of the purpose of the task. To this end, a desirable robotic application should provide an on-site robot teaching framework for non-experts. Here we propose a Learning-from-Observation (LfO) framework for grasp-manipulation-release class household operations (GMR-operations). The framework maps human demonstrations to predefined task models through one-shot teaching. Each task model contains both high-level knowledge regarding the geometric constraints and low-level knowledge related to human postures. The key idea is to design a task model that 1) covers various GMR-operations and 2) includes human postures to achieve tasks. We verify the applicability of our framework by testing an operational LfO system with a real robot. In addition, we quantify the coverage of the task model by analyzing online videos of household operations. In the context of one-shot robot teaching, the contribution of this study is a framework that 1) covers various GMR-operations and 2) mimics human postures during the operations.

ROJul 17, 2020
Verbal Focus-of-Attention System for Learning-from-Observation

Naoki Wake, Iori Yanokura, Kazuhiro Sasabuchi et al.

The learning-from-observation (LfO) framework aims to map human demonstrations to a robot to reduce programming effort. To this end, an LfO system encodes a human demonstration into a series of execution units for a robot, which are referred to as task models. Although previous research has proposed successful task-model encoders, there has been little discussion on how to guide a task-model encoder in a scene with spatio-temporal noises, such as cluttered objects or unrelated human body movements. Inspired by the function of verbal instructions guiding an observer's visual attention, we propose a verbal focus-of-attention (FoA) system (i.e., spatio-temporal filters) to guide a task-model encoder. For object manipulation, the system first recognizes the name of a target object and its attributes from verbal instructions. The information serves as a where-to-look FoA filter to confine the areas in which the target object existed in the demonstration. The system then detects the timings of grasp and release that occurred in the filtered areas. The timings serve as a when-to-look FoA filter to confine the period of object manipulation. Finally, a task-model encoder recognizes the task models by employing FoA filters. We demonstrate the robustness of the verbal FoA in attenuating spatio-temporal noises by comparing it with an existing action localization network. The contributions of this study are as follows: (1) to propose a verbal FoA for LfO, (2) to design an algorithm to calculate FoA filters from verbal input, and (3) to demonstrate the effectiveness of a verbal FoA in localizing an action by comparing it with a state-of-the-art vision system.