39.2ROJun 1
Co-training with Ego-centric Video and Demonstration for Robot Navigation TaskShoya Kuno, Yumo Ouchi, Kanata Suzuki
Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.
ROSep 26, 2023
Learning Multimodal Attention for Manipulating Deformable Objects with Changing StatesNamiko Saito, Mayu Tatsumi, Ayuna Kubo et al.
To support humans in their daily lives, robots are required to autonomously learn, adapt to objects and environments, and perform the appropriate actions. We tackled on the task of cooking scrambled eggs using real ingredients, in which the robot needs to perceive the states of the egg and adjust stirring movement in real time, while the egg is heated and the state changes continuously. In previous works, handling changing objects was found to be challenging because sensory information includes dynamical, both important or noisy information, and the modality which should be focused on changes every time, making it difficult to realize both perception and motion generation in real time. We propose a predictive recurrent neural network with an attention mechanism that can weigh the sensor input, distinguishing how important and reliable each modality is, that realize quick and efficient perception and motion generation. The model is trained with learning from the demonstration, and allows the robot to acquire human-like skills. We validated the proposed technique using the robot, Dry-AIREC, and with our learning model, it could perform cooking eggs with unknown ingredients. The robot could change the method of stirring and direction depending on the status of the egg, as in the beginning it stirs in the whole pot, then subsequently, after the egg started being heated, it starts flipping and splitting motion targeting specific areas, although we did not explicitly indicate them.
ROAug 30, 2023
Interactively Robot Action Planning with Uncertainty Analysis and Active Questioning by Large Language ModelKazuki Hori, Kanata Suzuki, Tetsuya Ogata
The application of the Large Language Model (LLM) to robot action planning has been actively studied. The instructions given to the LLM by natural language may include ambiguity and lack of information depending on the task context. It is possible to adjust the output of LLM by making the instruction input more detailed; however, the design cost is high. In this paper, we propose the interactive robot action planning method that allows the LLM to analyze and gather missing information by asking questions to humans. The method can minimize the design cost of generating precise robot instructions. We demonstrated the effectiveness of our method through concrete examples in cooking tasks. However, our experiments also revealed challenges in robot action planning with LLM, such as asking unimportant questions and assuming crucial information without asking. Shedding light on these issues provides valuable insights for future research on utilizing LLM for robotics.
ROMar 8, 2022
Learning Bidirectional Translation between Descriptions and Actions with Small Paired DataMinori Toyoda, Kanata Suzuki, Yoshihiko Hayashi et al.
This study achieved bidirectional translation between descriptions and actions using small paired data from different modalities. The ability to mutually generate descriptions and actions is essential for robots to collaborate with humans in their daily lives, which generally requires a large dataset that maintains comprehensive pairs of both modality data. However, a paired dataset is expensive to construct and difficult to collect. To address this issue, this study proposes a two-stage training method for bidirectional translation. In the proposed method, we train recurrent autoencoders (RAEs) for descriptions and actions with a large amount of non-paired data. Then, we finetune the entire model to bind their intermediate representations using small paired data. Because the data used for pre-training do not require pairing, behavior-only data or a large language corpus can be used. We experimentally evaluated our method using a paired dataset consisting of motion-captured actions and descriptions. The results showed that our method performed well, even when the amount of paired data to train was small. The visualization of the intermediate representations of each RAE showed that similar actions were encoded in a clustered position and the corresponding feature vectors were well aligned.
12.2ROMay 8
How to utilize failure demo data?: Effective data selection for imitation learning using distribution differences in attention mechanismKana Miyamoto, Kanata Suzuki, Tetsuya Ogata
Imitation learning for robotic tasks has relied primarily on policies trained only on successful demonstrations, although failures are unavoidable during human data collection. Many existing approaches for exploiting failure data require additional data processing or iterative policy updates through autonomous rollouts, making it difficult to directly and stably utilize failure data accumulated during data collection. In this work, we propose a method that learns latent representations of success-failure discrepancies and incorporates them into the attention mechanism. During inference, an appropriate latent mode is selected from the initial observation to improve action stability. Furthermore, we introduce a post-training metric that quantifies the attention discrepancy between each failure sample and successful demonstrations to select failure data. Simulation results show that the proposed method improves task success rates when trained with failure data and that the proposed metric identifies failure samples that are beneficial for learning when combined with successful demonstrations. These results suggest that the proposed method can support more efficient use of collected demonstrations in robotic data collection pipelines.
CVOct 30, 2021
Three approaches to facilitate DNN generalization to objects in out-of-distribution orientations and illuminationsAkira Sakai, Taro Sunagawa, Spandan Madan et al.
The training data distribution is often biased towards objects in certain orientations and illumination conditions. While humans have a remarkable capability of recognizing objects in out-of-distribution (OoD) orientations and illuminations, Deep Neural Networks (DNNs) severely suffer in this case, even when large amounts of training examples are available. In this paper, we investigate three different approaches to improve DNNs in recognizing objects in OoD orientations and illuminations. Namely, these are (i) training much longer after convergence of the in-distribution (InD) validation accuracy, i.e., late-stopping, (ii) tuning the momentum parameter of the batch normalization layers, and (iii) enforcing invariance of the neural activity in an intermediate layer to orientation and illumination conditions. Each of these approaches substantially improves the DNN's OoD accuracy (more than 20% in some cases). We report results in four datasets: two datasets are modified from the MNIST and iLab datasets, and the other two are novel (one of 3D rendered cars and another of objects taken from various controlled orientations and illumination conditions). These datasets allow to study the effects of different amounts of bias and are challenging as DNNs perform poorly in OoD conditions. Finally, we demonstrate that even though the three approaches focus on different aspects of DNNs, they all tend to lead to the same underlying neural mechanism to enable OoD accuracy gains --individual neurons in the intermediate layers become more selective to a category and also invariant to OoD orientations and illuminations. We anticipate this study to be a basis for further improvement of deep neural networks' OoD generalization performance, which is highly demanded to achieve safe and fair AI applications.
ROOct 3, 2021
Annotation Cost Reduction of Stream-based Active Learning by Automated Weak Labeling using a Robot ArmKanata Suzuki, Taro Sunagawa, Tomotake Sasaki et al.
Stream-based active learning (AL) is an efficient training data collection method, and it is used to reduce human annotation cost required in machine learning. However, it is difficult to say that the human cost is low enough because most previous studies have assumed that an oracle is a human with domain knowledge. In this study, we propose a method to replace a part of the oracle's work in stream-based AL by self-training with weak labeling using a robot arm. A camera attached to a robot arm takes a series of image data related to a streamed object, which should have the same label. We use this information as a weak label to connect a pseudo-label (estimated class label) and a target instance. Our method selects two data from a series of image data; high confidence data for correcting pseudo-labels and low confidence data for improving the performance of the classifier. We paired a pseudo-label provided to high confidence data with a target instance (low confidence data). By using this technique, we mitigate the inefficiency in self-training, that is, difficulty in creating pseudo-labeled training data with a high impact on the target classifier. In the experiments, we employed the proposed method in the classification task of objects on a belt conveyor. We evaluated the performance against human cost on multiple scenarios considering the temporal variation of data. The proposed method achieves the same or better performance as the conventional methods while reducing human cost.
ROApr 17, 2021
Embodying Pre-Trained Word Embeddings Through Robot ActionsMinori Toyoda, Kanata Suzuki, Hiroki Mori et al.
We propose a promising neural network model with which to acquire a grounded representation of robot actions and the linguistic descriptions thereof. Properly responding to various linguistic expressions, including polysemous words, is an important ability for robots that interact with people via linguistic dialogue. Previous studies have shown that robots can use words that are not included in the action-description paired datasets by using pre-trained word embeddings. However, the word embeddings trained under the distributional hypothesis are not grounded, as they are derived purely from a text corpus. In this letter, we transform the pre-trained word embeddings to embodied ones by using the robot's sensory-motor experiences. We extend a bidirectional translation model for actions and descriptions by incorporating non-linear layers that retrofit the word embeddings. By training the retrofit layer and the bidirectional translation model alternately, our proposed model is able to transform the pre-trained word embeddings to adapt to a paired action-description dataset. Our results demonstrate that the embeddings of synonyms form a semantic cluster by reflecting the experiences (actions and environments) of a robot. These embeddings allow the robot to properly generate actions from unseen words that are not paired with actions in a dataset.
ROMar 17, 2021
In-air Knotting of Rope using Dual-Arm Robot based on Deep LearningKanata Suzuki, Momomi Kanamura, Yuki Suga et al.
In this study, we report the successful execution of in-air knotting of rope using a dual-arm two-finger robot based on deep learning. Owing to its flexibility, the state of the rope was in constant flux during the operation of the robot. This required the robot control system to dynamically correspond to the state of the object at all times. However, a manual description of appropriate robot motions corresponding to all object states is difficult to be prepared in advance. To resolve this issue, we constructed a model that instructed the robot to perform bowknots and overhand knots based on two deep neural networks trained using the data gathered from its sensorimotor, including visual and proximity sensors. The resultant model was verified to be capable of predicting the appropriate robot motions based on the sensory information available online. In addition, we designed certain task motions based on the Ian knot method using the dual-arm two-fingers robot. The designed knotting motions do not require a dedicated workbench or robot hand, thereby enhancing the versatility of the proposed method. Finally, experiments were performed to estimate the knotting performance of the real robot while executing overhand knots and bowknots on rope and its success rate. The experimental results established the effectiveness and high performance of the proposed method.
LGJan 18, 2021
Stable deep reinforcement learning method by predicting uncertainty in rewards as a subtaskKanata Suzuki, Tetsuya Ogata
In recent years, a variety of tasks have been accomplished by deep reinforcement learning (DRL). However, when applying DRL to tasks in a real-world environment, designing an appropriate reward is difficult. Rewards obtained via actual hardware sensors may include noise, misinterpretation, or failed observations. The learning instability caused by these unstable signals is a problem that remains to be solved in DRL. In this work, we propose an approach that extends existing DRL models by adding a subtask to directly estimate the variance contained in the reward signal. The model then takes the feature map learned by the subtask in a critic network and sends it to the actor network. This enables stable learning that is robust to the effects of potential noise. The results of experiments in the Atari game domain with unstable reward signals show that our method stabilizes training convergence. We also discuss the extensibility of the model by visualizing feature maps. This approach has the potential to make DRL more practical for use in noisy, real-world scenarios.
ROMar 31, 2020
HATSUKI : An anime character like robot figure platform with anime-style expressions and imitation learning based action generationPin-Chu Yang, Mohammed Al-Sada, Chang-Chieh Chiu et al.
Japanese character figurines are popular and have pivot position in Otaku culture. Although numerous robots have been developed, less have focused on otaku-culture or on embodying the anime character figurine. Therefore, we take the first steps to bridge this gap by developing Hatsuki, which is a humanoid robot platform with anime based design. Hatsuki's novelty lies in aesthetic design, 2D facial expressions, and anime-style behaviors that allows it to deliver rich interaction experiences resembling anime-characters. We explain our design implementation process of Hatsuki, followed by our evaluations. In order to explore user impressions and opinions towards Hatsuki, we conducted a questionnaire in the world's largest anime-figurine event. The results indicate that participants were generally very satisfied with Hatsuki's design, and proposed various use case scenarios and deployment contexts for Hatsuki. The second evaluation focused on imitation learning, as such method can provide better interaction ability in the real world and generate rich, context-adaptive behavior in different situations. We made Hatsuki learn 11 actions, combining voice, facial expressions and motions, through neuron network based policy model with our proposed interface. Results show our approach was successfully able to generate the actions through self-organized contexts, which shows the potential for generalizing our approach in further actions under different contexts. Lastly, we present our future research direction for Hatsuki, and provide our conclusion.
ROMar 12, 2020
A Multi-task Learning Framework for Grasping-Position Detection and Few-Shot ClassificationYasuto Yokota, Kanata Suzuki, Yuzi Kanazawa et al.
It is a big problem that a model of deep learning for a picking robot needs many labeled images. Operating costs of retraining a model becomes very expensive because the object shape of a product or a part often is changed in a factory. It is important to reduce the amount of labeled images required to train a model for a picking robot. In this study, we propose a multi-task learning framework for few-shot classification using feature vectors from an intermediate layer of a model that detects grasping positions. In the field of manufacturing, multitask for shape classification and grasping-position detection is often required for picking robots. Prior multi-task learning studies include methods to learn one task with feature vectors from a deep neural network (DNN) learned for another task. However, the DNN that was used to detect grasping positions has two problems with respect to extracting feature vectors from a layer for shape classification: (1) Because each layer of the grasping position detection DNN is activated by all objects in the input image, it is necessary to refine the features for each grasping position. (2) It is necessary to select a layer to extract the features suitable for shape classification. To tackle these issues, we propose a method to refine the features for each grasping position and to select features from the optimal layer of the DNN. We then evaluated the shape classification accuracy using these features from the grasping positions. Our results confirm that our proposed framework can classify object shapes even when the input image includes multiple objects and the number of images available for training is small.
ROMar 10, 2020
Compensation for undefined behaviors during robot task execution by switching controllers depending on embedded dynamics in RNNKanata Suzuki, Hiroki Mori, Tetsuya Ogata
Robotic applications require both correct task performance and compensation for undefined behaviors. Although deep learning is a promising approach to perform complex tasks, the response to undefined behaviors that are not reflected in the training dataset remains challenging. In a human-robot collaborative task, the robot may adopt an unexpected posture due to collisions and other unexpected events. Therefore, robots should be able to recover from disturbances for completing the execution of the intended task. We propose a compensation method for undefined behaviors by switching between two controllers. Specifically, the proposed method switches between learning-based and model-based controllers depending on the internal representation of a recurrent neural network that learns task dynamics. We applied the proposed method to a pick-and-place task and evaluated the compensation for undefined behaviors. Experimental results from simulations and on a real robot demonstrate the effectiveness and high performance of the proposed method.
ROMar 8, 2020
Online Self-Supervised Learning for Object Picking: Detecting Optimum Grasping Position using a Metric Learning ApproachKanata Suzuki, Yasuto Yokota, Yuzi Kanazawa et al.
Self-supervised learning methods are attractive candidates for automatic object picking. However, the trial samples lack the complete ground truth because the observable parts of the agent are limited. That is, the information contained in the trial samples is often insufficient to learn the specific grasping position of each object. Consequently, the training falls into a local solution, and the grasp positions learned by the robot are independent of the state of the object. In this study, the optimal grasping position of an individual object is determined from the grasping score, defined as the distance in the feature space obtained using metric learning. The closeness of the solution to the pre-designed optimal grasping position was evaluated in trials. The proposed method incorporates two types of feedback control: one feedback enlarges the grasping score when the grasping position approaches the optimum; the other reduces the negative feedback of the potential grasping positions among the grasping candidates. The proposed online self-supervised learning method employs two deep neural networks. : SSD that detects the grasping position of an object, and Siamese networks (SNs) that evaluate the trial sample using the similarity of two input data in the feature space. Our method embeds the relation of each grasping position as feature vectors by training the trial samples and a few pre-samples indicating the optimum grasping position. By incorporating the grasping score based on the feature space of SNs into the SSD training process, the method preferentially trains the optimum grasping position. In the experiment, the proposed method achieved a higher success rate than the baseline method using simple teaching signals. And the grasping scores in the feature space of the SNs accurately represented the grasping positions of the objects.
RODec 14, 2017
Motion Switching with Sensory and Instruction Signals by designing Dynamical Systems using Deep Neural NetworkKanata Suzuki, Hiroki Mori, Tetsuya Ogata
To ensure that a robot is able to accomplish an extensive range of tasks, it is necessary to achieve a flexible combination of multiple behaviors. This is because the design of task motions suited to each situation would become increasingly difficult as the number of situations and the types of tasks performed by them increase. To handle the switching and combination of multiple behaviors, we propose a method to design dynamical systems based on point attractors that accept (i) "instruction signals" for instruction-driven switching. We incorporate the (ii) "instruction phase" to form a point attractor and divide the target task into multiple subtasks. By forming an instruction phase that consists of point attractors, the model embeds a subtask in the form of trajectory dynamics that can be manipulated using sensory and instruction signals. Our model comprises two deep neural networks: a convolutional autoencoder and a multiple time-scale recurrent neural network. In this study, we apply the proposed method to manipulate soft materials. To evaluate our model, we design a cloth-folding task that consists of four subtasks and three patterns of instruction signals, which indicate the direction of motion. The results depict that the robot can perform the required task by combining subtasks based on sensory and instruction signals. And, our model determined the relations among these signals using its internal dynamics.