ROMar 18, 2022
Goal-conditioned dual-action imitation learning for dexterous dual-arm robot manipulationHeecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi
Long-horizon dexterous robot manipulation of deformable objects, such as banana peeling, is a problematic task because of the difficulties in object modeling and a lack of knowledge about stable and dexterous manipulation skills. This paper presents a goal-conditioned dual-action (GC-DA) deep imitation learning (DIL) approach that can learn dexterous manipulation skills using human demonstration data. Previous DIL methods map the current sensory input and reactive action, which often fails because of compounding errors in imitation learning caused by the recurrent computation of actions. The method predicts reactive action only when the precise manipulation of the target object is required (local action) and generates the entire trajectory when precise manipulation is not required (global action). This dual-action formulation effectively prevents compounding error in the imitation learning using the trajectory-based global action while responding to unexpected changes in the target object during the reactive local action. The proposed method was tested in a real dual-arm robot and successfully accomplished the banana-peeling task. Data from this and related works are available at: https://sites.google.com/view/multi-task-fine.
ROMay 5
RLDX-1 Technical ReportDongyoung Kim, Huiwon Jang, Myungkyu Koo et al.
While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and language-conditioned generalization) inherited from pre-trained Vision-Language Models, they still struggle with complex real-world tasks requiring broader functional capabilities (e.g. motion awareness, memory-aware decision making, and physical sensing). To address this, we introduce RLDX-1, a general-purpose robotic policy for dexterous manipulation built on the Multi-Stream Action Transformer (MSAT), an architecture that unifies these capabilities by integrating heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. RLDX-1 further combines this architecture with system-level design choices, including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. Through empirical evaluation, we show that RLDX-1 consistently outperforms recent frontier VLAs (e.g. $π_{0.5}$ and GR00T N1.6) across both simulation benchmarks and real-world tasks that require broad functional capabilities beyond general versatility. In particular, RLDX-1 shows superiority in ALLEX humanoid tasks by achieving success rates of 86.8% while $π_{0.5}$ and GR00T N1.6 achieve around 40%, highlighting the ability of RLDX-1 to control a high-DoF humanoid robot under diverse functional demands. Together, these results position RLDX-1 as a promising step toward reliable VLAs for complex, contact-rich, and dynamic real-world dexterous manipulation.
ROJan 15, 2024
Multi-task real-robot data with gaze attention for dual-arm fine manipulationHeecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi
In the field of robotic manipulation, deep imitation learning is recognized as a promising approach for acquiring manipulation skills. Additionally, learning from diverse robot datasets is considered a viable method to achieve versatility and adaptability. In such research, by learning various tasks, robots achieved generality across multiple objects. However, such multi-task robot datasets have mainly focused on single-arm tasks that are relatively imprecise, not addressing the fine-grained object manipulation that robots are expected to perform in the real world. This paper introduces a dataset of diverse object manipulations that includes dual-arm tasks and/or tasks requiring fine manipulation. To this end, we have generated dataset with 224k episodes (150 hours, 1,104 language instructions) which includes dual-arm fine tasks such as bowl-moving, pencil-case opening or banana-peeling, and this data is publicly available. Additionally, this dataset includes visual attention signals as well as dual-action labels, a signal that separates actions into a robust reaching trajectory and precise interaction with objects, and language instructions to achieve robust and precise object manipulation. We applied the dataset to our Dual-Action and Attention (DAA), a model designed for fine-grained dual arm manipulation tasks and robust against covariate shifts. The model was tested with over 7k total trials in real robot manipulation tasks, demonstrating its capability in fine manipulation.
ROFeb 19, 2022
Training Robots without Robots: Deep Imitation Learning for Master-to-Robot Policy TransferHeecheol Kim, Yoshiyuki Ohmura, Akihiko Nagakubo et al.
Deep imitation learning is promising for robot manipulation because it only requires demonstration samples. In this study, deep imitation learning is applied to tasks that require force feedback. However, existing demonstration methods have deficiencies; bilateral teleoperation requires a complex control scheme and is expensive, and kinesthetic teaching suffers from visual distractions from human intervention. This research proposes a new master-to-robot (M2R) policy transfer system that does not require robots for teaching force feedback-based manipulation tasks. The human directly demonstrates a task using a controller. This controller resembles the kinematic parameters of the robot arm and uses the same end-effector with force/torque (F/T) sensors to measure the force feedback. Using this controller, the operator can feel force feedback without a bilateral system. The proposed method can overcome domain gaps between the master and robot using gaze-based imitation learning and a simple calibration method. Furthermore, a Transformer is applied to infer policy from F/T sensory input. The proposed system was evaluated on a bottle-cap-opening task that requires force feedback.
ROFeb 10, 2022
Memory-based gaze prediction in deep imitation learning for robot manipulationHeecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi
Deep imitation learning is a promising approach that does not require hard-coded control rules in autonomous robot manipulation. The current applications of deep imitation learning to robot manipulation have been limited to reactive control based on the states at the current time step. However, future robots will also be required to solve tasks utilizing their memory obtained by experience in complicated environments (e.g., when the robot is asked to find a previously used object on a shelf). In such a situation, simple deep imitation learning may fail because of distractions caused by complicated environments. We propose that gaze prediction from sequential visual input enables the robot to perform a manipulation task that requires memory. The proposed algorithm uses a Transformer-based self-attention architecture for the gaze estimation based on sequential data to implement memory. The proposed method was evaluated with a real robot multi-object manipulation task that requires memory of the previous states.
ROAug 1, 2021
Transformer-based deep imitation learning for dual-arm robot manipulationHeecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi
Deep imitation learning is promising for solving dexterous manipulation tasks because it does not require an environment model and pre-programmed robot behavior. However, its application to dual-arm manipulation tasks remains challenging. In a dual-arm manipulation setup, the increased number of state dimensions caused by the additional robot manipulators causes distractions and results in poor performance of the neural networks. We address this issue using a self-attention mechanism that computes dependencies between elements in a sequential input and focuses on important elements. A Transformer, a variant of self-attention architecture, is applied to deep imitation learning to solve dual-arm manipulation tasks in the real world. The proposed method has been tested on dual-arm manipulation tasks using a real robot. The experimental results demonstrated that the Transformer-based deep imitation learning architecture can attend to the important features among the sensory inputs, therefore reducing distractions and improving manipulation performance when compared with the baseline architecture without the self-attention mechanisms. Data from this and related works are available at: https://sites.google.com/view/multi-task-fine.
ROFeb 2, 2021
Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulationHeecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi
A high-precision manipulation task, such as needle threading, is challenging. Physiological studies have proposed connecting low-resolution peripheral vision and fast movement to transport the hand into the vicinity of an object, and using high-resolution foveated vision to achieve the accurate homing of the hand to the object. The results of this study demonstrate that a deep imitation learning based method, inspired by the gaze-based dual resolution visuomotor control system in humans, can solve the needle threading task. First, we recorded the gaze movements of a human operator who was teleoperating a robot. Then, we used only a high-resolution image around the gaze to precisely control the thread position when it was close to the target. We used a low-resolution peripheral image to reach the vicinity of the target. The experimental results obtained in this study demonstrate that the proposed method enables precise manipulation tasks using a general-purpose robot manipulator and improves computational efficiency. Data from this and related works are available at: https://sites.google.com/view/multi-task-fine.
LGMar 22, 2019
Macro Action Reinforcement Learning with Sequence Disentanglement using Variational AutoencoderHeecheol Kim, Masanori Yamada, Kosuke Miyoshi et al.
One problem in the application of reinforcement learning to real-world problems is the curse of dimensionality on the action space. Macro actions, a sequence of primitive actions, have been studied to diminish the dimensionality of the action space with regard to the time axis. However, previous studies relied on humans defining macro actions or assumed macro actions as repetitions of the same primitive actions. We present Factorized Macro Action Reinforcement Learning (FaMARL) which autonomously learns disentangled factor representation of a sequence of actions to generate macro actions that can be directly applied to general reinforcement learning algorithms. FaMARL exhibits higher scores than other reinforcement learning algorithms on environments that require an extensive amount of search.
MLFeb 22, 2019
FAVAE: Sequence Disentanglement using Information Bottleneck PrincipleMasanori Yamada, Heecheol Kim, Kosuke Miyoshi et al.
We propose the factorized action variational autoencoder (FAVAE), a state-of-the-art generative model for learning disentangled and interpretable representations from sequential data via the information bottleneck without supervision. The purpose of disentangled representation learning is to obtain interpretable and transferable representations from data. We focused on the disentangled representation of sequential data since there is a wide range of potential applications if disentanglement representation is extended to sequential data such as video, speech, and stock market. Sequential data are characterized by dynamic and static factors: dynamic factors are time dependent, and static factors are independent of time. Previous models disentangle static and dynamic factors by explicitly modeling the priors of latent variables to distinguish between these factors. However, these models cannot disentangle representations between dynamic factors, such as disentangling "picking up" and "throwing" in robotic tasks. FAVAE can disentangle multiple dynamic factors. Since it does not require modeling priors, it can disentangle "between" dynamic factors. We conducted experiments to show that FAVAE can extract disentangled dynamic factors.