ROJun 29, 2022
Deep Active Visual Attention for Real-time Robot Motion Generation: Emergence of Tool-body Assimilation and Adaptive Tool-useHyogo Hiruma, Hiroshi Ito, Hiroki Mori et al.
Sufficiently perceiving the environment is a critical factor in robot motion generation. Although the introduction of deep visual processing models have contributed in extending this ability, existing methods lack in the ability to actively modify what to perceive; humans perform internally during visual cognitive processes. This paper addresses the issue by proposing a novel robot motion generation model, inspired by a human cognitive structure. The model incorporates a state-driven active top-down visual attention module, which acquires attentions that can actively change targets based on task states. We term such attentions as role-based attentions, since the acquired attention directed to targets that shared a coherent role throughout the motion. The model was trained on a robot tool-use task, in which the role-based attentions perceived the robot grippers and tool as identical end-effectors, during object picking and object dragging motions respectively. This is analogous to a biological phenomenon called tool-body assimilation, in which one regards a handled tool as an extension of one's body. The results suggested an improvement of flexibility in model's visual perception, which sustained stable attention and motion even if it was provided with untrained tools or exposed to experimenter's distractions.
RODec 27, 2023
Visual Spatial Attention and Proprioceptive Data-Driven Reinforcement Learning for Robust Peg-in-Hole Task Under Variable ConditionsAndré Yuji Yasutomi, Hideyuki Ichiwara, Hiroshi Ito et al.
Anchor-bolt insertion is a peg-in-hole task performed in the construction field for holes in concrete. Efforts have been made to automate this task, but the variable lighting and hole surface conditions, as well as the requirements for short setup and task execution time make the automation challenging. In this study, we introduce a vision and proprioceptive data-driven robot control model for this task that is robust to challenging lighting and hole surface conditions. This model consists of a spatial attention point network (SAP) and a deep reinforcement learning (DRL) policy that are trained jointly end-to-end to control the robot. The model is trained in an offline manner, with a sample-efficient framework designed to reduce training time and minimize the reality gap when transferring the model to the physical world. Through evaluations with an industrial robot performing the task in 12 unknown holes, starting from 16 different initial positions, and under three different lighting conditions (two with misleading shadows), we demonstrate that SAP can generate relevant attention points of the image even in challenging lighting conditions. We also show that the proposed model enables task execution with higher success rate and shorter task completion time than various baselines. Due to the proposed model's high effectiveness even in severe lighting, initial positions, and hole conditions, and the offline training framework's high sample-efficiency and short training time, this approach can be easily applied to construction.
ROMar 29, 2024
A Peg-in-hole Task Strategy for Holes in ConcreteAndré Yuji Yasutomi, Hiroki Mori, Tetsuya Ogata
A method that enables an industrial robot to accomplish the peg-in-hole task for holes in concrete is proposed. The proposed method involves slightly detaching the peg from the wall, when moving between search positions, to avoid the negative influence of the concrete's high friction coefficient. It uses a deep neural network (DNN), trained via reinforcement learning, to effectively find holes with variable shape and surface finish (due to the brittle nature of concrete) without analytical modeling or control parameter tuning. The method uses displacement of the peg toward the wall surface, in addition to force and torque, as one of the inputs of the DNN. Since the displacement increases as the peg gets closer to the hole (due to the chamfered shape of holes in concrete), it is a useful parameter for inputting in the DNN. The proposed method was evaluated by training the DNN on a hole 500 times and attempting to find 12 unknown holes. The results of the evaluation show the DNN enabled a robot to find the unknown holes with average success rate of 96.1% and average execution time of 12.5 seconds. Additional evaluations with random initial positions and a different type of peg demonstrate the trained DNN can generalize well to different conditions. Analyses of the influence of the peg displacement input showed the success rate of the DNN is increased by utilizing this parameter. These results validate the proposed method in terms of its effectiveness and applicability to the construction industry.
CLDec 6, 2024
Who Speaks Next? Multi-party AI Discussion Leveraging the Systematics of Turn-taking in Murder Mystery GamesRyota Nonomura, Hiroki Mori
Multi-agent systems utilizing large language models (LLMs) have shown great promise in achieving natural dialogue. However, smooth dialogue control and autonomous decision making among agents still remain challenges. In this study, we focus on conversational norms such as adjacency pairs and turn-taking found in conversation analysis and propose a new framework called "Murder Mystery Agents" that applies these norms to AI agents' dialogue control. As an evaluation target, we employed the "Murder Mystery" game, a reasoning-type table-top role-playing game that requires complex social reasoning and information manipulation. In this game, players need to unravel the truth of the case based on fragmentary information through cooperation and bargaining. The proposed framework integrates next speaker selection based on adjacency pairs and a self-selection mechanism that takes agents' internal states into account to achieve more natural and strategic dialogue. To verify the effectiveness of this new approach, we analyzed utterances that led to dialogue breakdowns and conducted automatic evaluation using LLMs, as well as human evaluation using evaluation criteria developed for the Murder Mystery game. Experimental results showed that the implementation of the next speaker selection mechanism significantly reduced dialogue breakdowns and improved the ability of agents to share information and perform logical reasoning. The results of this study demonstrate that the systematics of turn-taking in human conversation are also effective in controlling dialogue among AI agents, and provide design guidelines for more advanced multi-agent dialogue systems.
ROOct 11, 2025
A3RNN: Bi-directional Fusion of Bottom-up and Top-down Process for Developmental Visual Attention in RobotsHyogo Hiruma, Hiroshi Ito, Hiroki Mori et al.
This study investigates the developmental interaction between top-down (TD) and bottom-up (BU) visual attention in robotic learning. Our goal is to understand how structured, human-like attentional behavior emerges through the mutual adaptation of TD and BU mechanisms over time. To this end, we propose a novel attention model $A^3 RNN$ that integrates predictive TD signals and saliency-based BU cues through a bi-directional attention architecture. We evaluate our model in robotic manipulation tasks using imitation learning. Experimental results show that attention behaviors evolve throughout training, from saliency-driven exploration to prediction-driven direction. Initially, BU attention highlights visually salient regions, which guide TD processes, while as learning progresses, TD attention stabilizes and begins to reshape what is perceived as salient. This trajectory reflects principles from cognitive science and the free-energy framework, suggesting the importance of self-organizing attention through interaction between perception and internal prediction. Although not explicitly optimized for stability, our model exhibits more coherent and interpretable attention patterns than baselines, supporting the idea that developmental mechanisms contribute to robust attention formation.
ROFeb 26, 2022
Learning-based Collision-free Planning on Arbitrary Optimization Criteria in the Latent Space through cGANsTomoki Ando, Hiroto Iino, Hiroki Mori et al.
We propose a new method for collision-free planning using Conditional Generative Adversarial Networks (cGANs) to transform between the robot's joint space and a latent space that captures only collision-free areas of the joint space, conditioned by an obstacle map. Generating multiple plausible trajectories is convenient in applications such as the manipulation of a robot arm by enabling the selection of trajectories that avoids collision with the robot or surrounding environment. In the proposed method, various trajectories that avoid obstacles can be generated by connecting the start and goal state with arbitrary line segments in this generated latent space. Our method provides this collision-free latent space, after which any planner, using any optimization conditions, can be used to generate the most suitable paths on the fly. We successfully verified this method with a simulated and actual UR5e 6-DoF robotic arm. We confirmed that different trajectories could be generated depending on optimization conditions.
ROFeb 21, 2022
Guided Visual Attention Model Based on Interactions Between Top-down and Bottom-up Information for Robot Pose PredictionHyogo Hiruma, Hiroki Mori, Hiroshi Ito et al.
Deep robot vision models are widely used for recognizing objects from camera images, but shows poor performance when detecting objects at untrained positions. Although such problem can be alleviated by training with large datasets, the dataset collection cost cannot be ignored. Existing visual attention models tackled the problem by employing a data efficient structure which learns to extract task relevant image areas. However, since the models cannot modify attention targets after training, it is difficult to apply to dynamically changing tasks. This paper proposed a novel Key-Query-Value formulated visual attention model. This model is capable of switching attention targets by externally modifying the Query representations, namely top-down attention. The proposed model is experimented on a simulator and a real-world environment. The model was compared to existing end-to-end robot vision models in the simulator experiments, showing higher performance and data efficiency. In the real-world robot experiments, the model showed high precision along with its scalability and extendibility.
ROFeb 15, 2022
Collision-free Path Planning in the Latent Space through cGANsTomoki Ando, Hiroki Mori, Ryota Torishima et al.
We show a new method for collision-free path planning by cGANs by mapping its latent space to only the collision-free areas of the robot joint space. Our method simply provides this collision-free latent space after which any planner, using any optimization conditions, can be used to generate the most suitable paths on the fly. We successfully verified this method with a simulated two-link robot arm.
RODec 13, 2021
Contact-Rich Manipulation of a Flexible Object based on Deep Predictive Learning using Vision and TactilityHideyuki Ichiwara, Hiroshi Ito, Kenjiro Yamamoto et al.
We achieved contact-rich flexible object manipulation, which was difficult to control with vision alone. In the unzipping task we chose as a validation task, the gripper grasps the puller, which hides the bag state such as the direction and amount of deformation behind it, making it difficult to obtain information to perform the task by vision alone. Additionally, the flexible fabric bag state constantly changes during operation, so the robot needs to dynamically respond to the change. However, the appropriate robot behavior for all bag states is difficult to prepare in advance. To solve this problem, we developed a model that can perform contact-rich flexible object manipulation by real-time prediction of vision with tactility. We introduced a point-based attention mechanism for extracting image features, softmax transformation for predicting motions, and convolutional neural network for extracting tactile features. The results of experiments using a real robot arm revealed that our method can realize motions responding to the deformation of the bag while reducing the load on the zipper. Furthermore, using tactility improved the success rate from 56.7% to 93.3% compared with vision alone, demonstrating the effectiveness and high performance of our method.
ROJun 4, 2021
How to select and use tools? : Active Perception of Target Objects Using Multimodal Deep LearningNamiko Saito, Tetsuya Ogata, Satoshi Funabashi et al.
Selection of appropriate tools and use of them when performing daily tasks is a critical function for introducing robots for domestic applications. In previous studies, however, adaptability to target objects was limited, making it difficult to accordingly change tools and adjust actions. To manipulate various objects with tools, robots must both understand tool functions and recognize object characteristics to discern a tool-object-action relation. We focus on active perception using multimodal sensorimotor data while a robot interacts with objects, and allow the robot to recognize their extrinsic and intrinsic characteristics. We construct a deep neural networks (DNN) model that learns to recognize object characteristics, acquires tool-object-action relations, and generates motions for tool selection and handling. As an example tool-use situation, the robot performs an ingredients transfer task, using a turner or ladle to transfer an ingredient from a pot to a bowl. The results confirm that the robot recognizes object characteristics and servings even when the target ingredients are unknown. We also examine the contributions of images, force, and tactile data and show that learning a variety of multimodal information results in rich perception for tool use.
ROApr 17, 2021
Embodying Pre-Trained Word Embeddings Through Robot ActionsMinori Toyoda, Kanata Suzuki, Hiroki Mori et al.
We propose a promising neural network model with which to acquire a grounded representation of robot actions and the linguistic descriptions thereof. Properly responding to various linguistic expressions, including polysemous words, is an important ability for robots that interact with people via linguistic dialogue. Previous studies have shown that robots can use words that are not included in the action-description paired datasets by using pre-trained word embeddings. However, the word embeddings trained under the distributional hypothesis are not grounded, as they are derived purely from a text corpus. In this letter, we transform the pre-trained word embeddings to embodied ones by using the robot's sensory-motor experiences. We extend a bidirectional translation model for actions and descriptions by incorporating non-linear layers that retrofit the word embeddings. By training the retrofit layer and the bidirectional translation model alternately, our proposed model is able to transform the pre-trained word embeddings to adapt to a paired action-description dataset. Our results demonstrate that the embeddings of synonyms form a semantic cluster by reflecting the experiences (actions and environments) of a robot. These embeddings allow the robot to properly generate actions from unseen words that are not paired with actions in a dataset.
ROMar 17, 2021
In-air Knotting of Rope using Dual-Arm Robot based on Deep LearningKanata Suzuki, Momomi Kanamura, Yuki Suga et al.
In this study, we report the successful execution of in-air knotting of rope using a dual-arm two-finger robot based on deep learning. Owing to its flexibility, the state of the rope was in constant flux during the operation of the robot. This required the robot control system to dynamically correspond to the state of the object at all times. However, a manual description of appropriate robot motions corresponding to all object states is difficult to be prepared in advance. To resolve this issue, we constructed a model that instructed the robot to perform bowknots and overhand knots based on two deep neural networks trained using the data gathered from its sensorimotor, including visual and proximity sensors. The resultant model was verified to be capable of predicting the appropriate robot motions based on the sensory information available online. In addition, we designed certain task motions based on the Ian knot method using the dual-arm two-fingers robot. The designed knotting motions do not require a dedicated workbench or robot hand, thereby enhancing the versatility of the proposed method. Finally, experiments were performed to estimate the knotting performance of the real robot while executing overhand knots and bowknots on rope and its success rate. The experimental results established the effectiveness and high performance of the proposed method.
ROMar 2, 2021
Spatial Attention Point Network for Deep-learning-based Robust Autonomous Robot Motion GenerationHideyuki Ichiwara, Hiroshi Ito, Kenjiro Yamamoto et al.
Deep learning provides a powerful framework for automated acquisition of complex robotic motions. However, despite a certain degree of generalization, the need for vast amounts of training data depending on the work-object position is an obstacle to industrial applications. Therefore, a robot motion-generation model that can respond to a variety of work-object positions with a small amount of training data is necessary. In this paper, we propose a method robust to changes in object position by automatically extracting spatial attention points in the image for the robot task and generating motions on the basis of their positions. We demonstrate our method with an LBR iiwa 7R1400 robot arm on a picking task and a pick-and-place task at various positions in various situations. In each task, the spatial attention points are obtained for the work objects that are important to the task. Our method is robust to changes in object position. Further, it is robust to changes in background, lighting, and obstacles that are not important to the task because it only focuses on positions that are important to the task.
ROMar 10, 2020
Compensation for undefined behaviors during robot task execution by switching controllers depending on embedded dynamics in RNNKanata Suzuki, Hiroki Mori, Tetsuya Ogata
Robotic applications require both correct task performance and compensation for undefined behaviors. Although deep learning is a promising approach to perform complex tasks, the response to undefined behaviors that are not reflected in the training dataset remains challenging. In a human-robot collaborative task, the robot may adopt an unexpected posture due to collisions and other unexpected events. Therefore, robots should be able to recover from disturbances for completing the execution of the intended task. We propose a compensation method for undefined behaviors by switching between two controllers. Specifically, the proposed method switches between learning-based and model-based controllers depending on the internal representation of a recurrent neural network that learns task dynamics. We applied the proposed method to a pick-and-place task and evaluated the compensation for undefined behaviors. Experimental results from simulations and on a real robot demonstrate the effectiveness and high performance of the proposed method.
RODec 14, 2017
Motion Switching with Sensory and Instruction Signals by designing Dynamical Systems using Deep Neural NetworkKanata Suzuki, Hiroki Mori, Tetsuya Ogata
To ensure that a robot is able to accomplish an extensive range of tasks, it is necessary to achieve a flexible combination of multiple behaviors. This is because the design of task motions suited to each situation would become increasingly difficult as the number of situations and the types of tasks performed by them increase. To handle the switching and combination of multiple behaviors, we propose a method to design dynamical systems based on point attractors that accept (i) "instruction signals" for instruction-driven switching. We incorporate the (ii) "instruction phase" to form a point attractor and divide the target task into multiple subtasks. By forming an instruction phase that consists of point attractors, the model embeds a subtask in the form of trajectory dynamics that can be manipulated using sensory and instruction signals. Our model comprises two deep neural networks: a convolutional autoencoder and a multiple time-scale recurrent neural network. In this study, we apply the proposed method to manipulate soft materials. To evaluate our model, we design a cloth-folding task that consists of four subtasks and three patterns of instruction signals, which indicate the direction of motion. The results depict that the robot can perform the required task by combining subtasks based on sensory and instruction signals. And, our model determined the relations among these signals using its internal dynamics.
MEDec 12, 2017
Causal Patterns: Extraction of multiple causal relationships by Mixture of Probabilistic Partial Canonical Correlation AnalysisHiroki Mori, Keisuke Kawano, Hiroki Yokoyama
In this paper, we propose a mixture of probabilistic partial canonical correlation analysis (MPPCCA) that extracts the Causal Patterns from two multivariate time series. Causal patterns refer to the signal patterns within interactions of two elements having multiple types of mutually causal relationships, rather than a mixture of simultaneous correlations or the absence of presence of a causal relationship between the elements. In multivariate statistics, partial canonical correlation analysis (PCCA) evaluates the correlation between two multivariates after subtracting the effect of the third multivariate. PCCA can calculate the Granger Causal- ity Index (which tests whether a time-series can be predicted from an- other time-series), but is not applicable to data containing multiple partial canonical correlations. After introducing the MPPCCA, we propose an expectation-maxmization (EM) algorithm that estimates the parameters and latent variables of the MPPCCA. The MPPCCA is expected to ex- tract multiple partial canonical correlations from data series without any supervised signals to split the data as clusters. The method was then eval- uated in synthetic data experiments. In the synthetic dataset, our method estimated the multiple partial canonical correlations more accurately than the existing method. To determine the types of patterns detectable by the method, experiments were also conducted on real datasets. The method estimated the communication patterns In motion-capture data. The MP- PCCA is applicable to various type of signals such as brain signals, human communication and nonlinear complex multibody systems.