RONov 20, 2023
GPT-4V(ision) for Robotics: Multimodal Task Planning from Human DemonstrationNaoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi et al.
We introduce a pipeline that enhances a general-purpose Vision Language Model, GPT-4V(ision), to facilitate one-shot visual teaching for robotic manipulation. This system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances. The process begins with GPT-4V analyzing the videos to obtain textual explanations of environmental and action details. A GPT-4-based task planner then encodes these details into a symbolic task plan. Subsequently, vision systems spatially and temporally ground the task plan in the videos. Objects are identified using an open-vocabulary object detector, and hand-object interactions are analyzed to pinpoint moments of grasping and releasing. This spatiotemporal grounding allows for the gathering of affordance information (e.g., grasp types, waypoints, and body postures) critical for robot execution. Experiments across various scenarios demonstrate the method's efficacy in enabling real robots to operate from one-shot human demonstrations. Meanwhile, quantitative tests have revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline. The prompts of GPT-4V/GPT-4 are available at this project page: https://microsoft.github.io/GPT4Vision-Robot-Manipulation-Prompts/
ROOct 18, 2023
Bias in Emotion Recognition with ChatGPTNaoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi et al.
This technical report explores the ability of ChatGPT in recognizing emotions from text, which can be the basis of various applications like interactive chatbots, data annotation, and mental health analysis. While prior research has shown ChatGPT's basic ability in sentiment analysis, its performance in more nuanced emotion recognition is not yet explored. Here, we conducted experiments to evaluate its performance of emotion recognition across different datasets and emotion labels. Our findings indicate a reasonable level of reproducibility in its performance, with noticeable improvement through fine-tuning. However, the performance varies with different emotion labels and datasets, highlighting an inherent instability and possible bias. The choice of dataset and emotion labels significantly impacts ChatGPT's emotion recognition performance. This paper sheds light on the importance of dataset and label selection, and the potential of fine-tuning in enhancing ChatGPT's emotion recognition capabilities, providing a groundwork for better integration of emotion analysis in applications using ChatGPT.
CVAug 30, 2024Code
Open-Vocabulary Action Localization with Iterative Visual PromptingNaoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi et al.
Video action localization aims to find the timings of specific actions from a long video. Although existing learning-based approaches have been successful, they require annotating videos, which comes with a considerable labor cost. This paper proposes a training-free, open-vocabulary approach based on emerging off-the-shelf vision-language models (VLMs). The challenge stems from the fact that VLMs are neither designed to process long videos nor tailored for finding actions. We overcome these problems by extending an iterative visual prompting technique. Specifically, we sample video frames and create a concatenated image with frame index labels, allowing a VLM to identify the frames that most likely correspond to the start and end of the action. By iteratively narrowing the sampling window around the selected frames, the estimation gradually converges to more precise temporal boundaries. We demonstrate that this technique yields reasonable performance, achieving results comparable to state-of-the-art zero-shot action localization. These results support the use of VLMs as a practical tool for understanding videos. Sample code is available at https://microsoft.github.io/VLM-Video-Action-Localization/
CVApr 11, 2023
Efficiently Collecting Training Dataset for 2D Object Detection by Online Visual FeedbackTakuya Kiyokawa, Naoki Shirakura, Hiroki Katayama et al.
Training deep-learning-based vision systems require the manual annotation of a significant number of images. Such manual annotation is highly time-consuming and labor-intensive. Although previous studies have attempted to eliminate the effort required for annotation, the effort required for image collection was retained. To address this, we propose a human-in-the-loop dataset collection method that uses a web application. To counterbalance the workload and performance by encouraging the collection of multi-view object image datasets in an enjoyable manner, thereby amplifying motivation, we propose three types of online visual feedback features to track the progress of the collection status. Our experiments thoroughly investigated the impact of each feature on collection performance and quality of operation. The results suggested the feasibility of annotation and object detection.
ROMay 10, 2023Code
GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat SystemNaoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi et al.
This technical paper introduces a chatting robot system that utilizes recent advancements in large-scale language models (LLMs) such as GPT-3 and ChatGPT. The system is integrated with a co-speech gesture generation system, which selects appropriate gestures based on the conceptual meaning of speech. Our motivation is to explore ways of utilizing the recent progress in LLMs for practical robotic applications, which benefits the development of both chatbots and LLMs. Specifically, it enables the development of highly responsive chatbot systems by leveraging LLMs and adds visual effects to the user interface of LLMs as an additional value. The source code for the system is available on GitHub for our in-house robot (https://github.com/microsoft/LabanotationSuite/tree/master/MSRAbotChatSimulation) and GitHub for Toyota HSR (https://github.com/microsoft/GPT-Enabled-HSR-CoSpeechGestures).
ROJan 7, 2025
VLM-driven Behavior Tree for Context-aware Task PlanningNaoki Wake, Atsushi Kanehira, Jun Takamatsu et al.
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
ROApr 1, 2025
Plan-and-Act using Large Language Models for Interactive AgreementKazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira et al.
Recent large language models (LLMs) are capable of planning robot actions. In this paper, we explore how LLMs can be used for planning actions with tasks involving situational human-robot interaction (HRI). A key problem of applying LLMs in situational HRI is balancing between "respecting the current human's activity" and "prioritizing the robot's task," as well as understanding the timing of when to use the LLM to generate an action plan. In this paper, we propose a necessary plan-and-act skill design to solve the above problems. We show that a critical factor for enabling a robot to switch between passive / active interaction behavior is to provide the LLM with an action text about the current robot's action. We also show that a second-stage question to the LLM (about the next timing to call the LLM) is necessary for planning actions at an appropriate timing. The skill design is applied to an Engage skill and is tested on four distinct interaction scenarios. We show that by using the skill design, LLMs can be leveraged to easily scale to different HRI scenarios with a reasonable success rate reaching 90% on the test scenarios.
RODec 15, 2024
Modality-Driven Design for Multi-Step Dexterous Manipulation: Insights from NeuroscienceNaoki Wake, Atsushi Kanehira, Daichi Saito et al.
Multi-step dexterous manipulation is a fundamental skill in household scenarios, yet remains an underexplored area in robotics. This paper proposes a modular approach, where each step of the manipulation process is addressed with dedicated policies based on effective modality input, rather than relying on a single end-to-end model. To demonstrate this, a dexterous robotic hand performs a manipulation task involving picking up and rotating a box. Guided by insights from neuroscience, the task is decomposed into three sub-skills, 1)reaching, 2)grasping and lifting, and 3)in-hand rotation, based on the dominant sensory modalities employed in the human brain. Each sub-skill is addressed using distinct methods from a practical perspective: a classical controller, a Vision-Language-Action model, and a reinforcement learning policy with force feedback, respectively. We tested the pipeline on a real robot to demonstrate the feasibility of our approach. The key contribution of this study lies in presenting a neuroscience-inspired, modality-driven methodology for multi-step dexterous manipulation.
HCJan 7, 2025
Agreeing to Interact in Human-Robot Interaction using Large Language Models and Vision Language ModelsKazuhiro Sasabuchi, Naoki Wake, Atsushi Kanehira et al.
In human-robot interaction (HRI), the beginning of an interaction is often complex. Whether the robot should communicate with the human is dependent on several situational factors (e.g., the current human's activity, urgency of the interaction, etc.). We test whether large language models (LLM) and vision language models (VLM) can provide solutions to this problem. We compare four different system-design patterns using LLMs and VLMs, and test on a test set containing 84 human-robot situations. The test set mixes several publicly available datasets and also includes situations where the appropriate action to take is open-ended. Our results using the GPT-4o and Phi-3 Vision model indicate that LLMs and VLMs are capable of handling interaction beginnings when the desired actions are clear, however, challenge remains in the open-ended situations where the model must balance between the human and robot situation.
ROMay 1, 2025
IK Seed Generator for Dual-Arm Human-like Physicality Robot with Mobile BaseJun Takamatsu, Atsushi Kanehira, Kazuhiro Sasabuchi et al.
Robots are strongly expected as a means of replacing human tasks. If a robot has a human-like physicality, the possibility of replacing human tasks increases. In the case of household service robots, it is desirable for them to be on a human-like size so that they do not become excessively large in order to coexist with humans in their operating environment. However, robots with size limitations tend to have difficulty solving inverse kinematics (IK) due to mechanical limitations, such as joint angle limitations. Conversely, if the difficulty coming from this limitation could be mitigated, one can expect that the use of such robots becomes more valuable. In numerical IK solver, which is commonly used for robots with higher degrees-of-freedom (DOF), the solvability of IK depends on the initial guess given to the solver. Thus, this paper proposes a method for generating a good initial guess for a numerical IK solver given the target hand configuration. For the purpose, we define the goodness of an initial guess using the scaled Jacobian matrix, which can calculate the manipulability index considering the joint limits. These two factors are related to the difficulty of solving IK. We generate the initial guess by optimizing the goodness using the genetic algorithm (GA). To enumerate much possible IK solutions, we use the reachability map that represents the reachable area of the robot hand in the arm-base coordinate system. We conduct quantitative evaluation and prove that using an initial guess that is judged to be better using the goodness value increases the probability that IK is solved. Finally, as an application of the proposed method, we show that by generating good initial guesses for IK a robot actually achieves three typical scenarios.
ROApr 7, 2025
A Taxonomy of Self-HandoverNaoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi et al.
Self-handover, transferring an object between one's own hands, is a common but understudied bimanual action. While it facilitates seamless transitions in complex tasks, the strategies underlying its execution remain largely unexplored. Here, we introduce the first systematic taxonomy of self-handover, derived from manual annotation of over 12 hours of cooking activity performed by 21 participants. Our analysis reveals that self-handover is not merely a passive transition, but a highly coordinated action involving anticipatory adjustments by both hands. As a step toward automated analysis of human manipulation, we further demonstrate the feasibility of classifying self-handover types using a state-of-the-art vision-language model. These findings offer fresh insights into bimanual coordination, underscoring the role of self-handover in enabling smooth task transitions-an ability essential for adaptive dual-arm robotics.
RONov 16, 2021
Active Vapor-Based Robotic WiperTakuya Kiyokawa, Hiroki Katayama, Jun Takamatsu et al.
This paper presents a method for estimating normals of mirrors and transparent objects challenging for cameras to recognize. We propose spraying water vapor onto mirror or transparent surfaces to create a diffuse reflective surface. Using an ultrasonic humidifier on a robotic arm, we apply water vapor to the target object's surface, forming a cross-shaped misted area. This creates partially diffuse reflective surfaces, enabling the camera to detect the target object's surface. Adjusting the gripper-mounted camera viewpoint maximizes the extracted misted area's appearance in the image, allowing normal estimation of the target surface. Experiments show the method's effectiveness, with RMSEs of azimuth estimation for mirrors and transparent glass at approximately 4.2 and 5.8 degrees, respectively. Our robot experiments demonstrated that our robotic wiper can perform contact-force-regulated wiping motions to clean a transparent window, akin to human performance.
ROSep 15, 2021
Soft-Jig: A Flexible Sensing Jig for Simultaneously Fixing and Estimating Orientation of Assembly PartsTatsuya Sakuma, Takuya Kiyokawa, Jun Takamatsu et al.
For assembly tasks, it is essential to firmly fix target parts and to accurately estimate their poses. Several rigid jigs for individual parts are frequently used in assembly factories to achieve precise and time-efficient product assembly. However, providing customized jigs is time-consuming. In this study, to address the lack of versatility in the shapes the jigs can be used for, we developed a flexible jig with a soft membrane including transparent beads and oil with a tuned refractive index. The bead-based jamming transition was accomplished by discharging only oil enabling a part to be firmly fixed. Because the two cameras under the jig are able to capture membrane shape changes, we proposed a sensing method to estimate the orientation of the part based on the behaviors of markers created on the jig's inner surface. Through estimation experiments, the proposed system could estimate the orientation of a cylindrical object with a diameter larger than 50 mm and an RMSE of less than 3 degrees.
ROApr 2, 2021
Robotic Waste Sorter with Agile Manipulation and Quickly Trainable DetectorTakuya Kiyokawa, Hiroki Katayama, Yuya Tatsuta et al.
Owing to human labor shortages, the automation of labor-intensive manual waste-sorting is needed. The goal of automating waste-sorting is to replace the human role of robust detection and agile manipulation of waste items with robots. To achieve this, we propose three methods. First, we provide a combined manipulation method using graspless push-and-drop and pick-and-release manipulation. Second, we provide a robotic system that can automatically collect object images to quickly train a deep neural-network model. Third, we provide a method to mitigate the differences in the appearance of target objects from two scenes: one for dataset collection and the other for waste sorting in a recycling factory. If differences exist, the performance of a trained waste detector may decrease. We address differences in illumination and background by applying object scaling, histogram matching with histogram equalization, and background synthesis to the source target-object images. Via experiments in an indoor experimental workplace for waste-sorting, we confirm that the proposed methods enable quick collection of the training image sets for three classes of waste items (i.e., aluminum can, glass bottle, and plastic bottle) and detection with higher performance than the methods that do not consider the differences. We also confirm that the proposed method enables the robot quickly manipulate the objects.
ROMar 3, 2021
Semantic constraints to represent common sense required in household actions for multi-modal Learning-from-observation robotKatsushi Ikeuchi, Naoki Wake, Riku Arakawa et al.
The paradigm of learning-from-observation (LfO) enables a robot to learn how to perform actions by observing human-demonstrated actions. Previous research in LfO have mainly focused on the industrial domain which only consist of the observable physical constraints between a manipulating tool and the robot's working environment. In order to extend this paradigm to the household domain which consists non-observable constraints derived from a human's common sense; we introduce the idea of semantic constraints. The semantic constraints are represented similar to the physical constraints by defining a contact with an imaginary semantic environment. We thoroughly investigate the necessary and sufficient set of contact state and state transitions to understand the different types of physical and semantic constraints. We then apply our constraint representation to analyze various actions in top hit household YouTube videos and real home cooking recordings. We further categorize the frequently appearing constraint patterns into physical, semantic, and multistage task groups and verify that these groups are not only necessary but a sufficient set for covering standard household actions. Finally, we conduct a preliminary experiment using textual input to explore the possibilities of combining verbal and visual input for recognizing the task groups. Our results provide promising directions for incorporating common sense in the literature of robot teaching.
RODec 9, 2020
Toward an Affective Touch Robot: Subjective and Physiological Evaluation of Gentle Stroke Motion Using a Human-Imitation HandTomoki Ishikura, Akishige Yuguchi, Yuki Kitamura et al.
Affective touch offers positive psychological and physiological benefits such as the mitigation of stress and pain. If a robot could realize human-like affective touch, it would open up new application areas, including supporting care work. In this research, we focused on the gentle stroking motion of a robot to evoke the same emotions that human touch would evoke: in other words, an affective touch robot. We propose a robot that is able to gently stroke the back of a human using our designed human-imitation hand. To evaluate the emotional effects of this affective touch, we compared the results of a combination of two agents (the human-imitation hand and the human hand), at two stroke speeds (3 and 30 cm/s). The results of the subjective and physiological evaluations highlighted the following three findings: 1) the subjects evaluated strokes similarly with regard to the stroke speed of the human and human-imitation hand, in both the subjective and physiological evaluations; 2) the subjects felt greater pleasure and arousal at the faster stroke rate (30 cm/s rather than 3 cm/s); and 3) poorer fitting of the human-imitation hand due to the bending of the back had a negative emotional effect on the subjects.
ROOct 21, 2020
Assembly Sequences Based on Multiple Criteria Against Products with Deformable PartsTakuya Kiyokawa, Jun Takamatsu, Tsukasa Ogasawara
Aiming to generate easy-to-handle assembly sequences for robotic assembly, this study tackles assembly sequence generation by considering two tradeoff objectives: (1) insertion conditions and (2) degrees of constraints among assembled parts. We propose a multiobjective genetic algorithm to balance these two objectives for generating assembly sequences. Furthermore, the method of extracting part relation matrices including interference-free, insertion, and degree of constraint matrices is extended for application to 3D computer-aided design (CAD) models, including deformable parts. The interference of deformable parts with other parts can be easily investigated by scaling parts. A simulation experiment was conducted using the proposed method, and the results show the possibility of obtaining Pareto-optimal solutions of assembly sequences for a 3D CAD model with 33 parts including a deformable part. This approach can potentially be extended to handle various types of deformable parts and to explore graspable sequences during assembly operations.
ROOct 21, 2020
Soft-Jig-Driven Assembly OperationsTakuya Kiyokawa, Tatsuya Sakuma, Jun Takamatsu et al.
To design a general-purpose assembly robot system that can handle objects of various shapes, we propose a soft jig that fits to the shapes of assembly parts. The functionality of the soft jig is based on a jamming gripper developed in the field of soft robotics. The soft jig has a bag covered with a malleable silicone membrane, which has high friction, elongation, and contraction rates for keeping parts fixed. The bag is filled with glass beads to achieve a jamming transition. We propose a method to configure parts-fixing on the soft jig based on contact relations, reachable directions, and the center of gravity of the parts that are fixed on the jig. The usability of the soft jig was evaluated in terms of the fixing performance and versatility for various shapes and postures of parts.
ROAug 4, 2020
A Learning-from-Observation Framework: One-Shot Robot Teaching for Grasp-Manipulation-Release Household OperationsNaoki Wake, Riku Arakawa, Iori Yanokura et al.
A household robot is expected to perform various manipulative operations with an understanding of the purpose of the task. To this end, a desirable robotic application should provide an on-site robot teaching framework for non-experts. Here we propose a Learning-from-Observation (LfO) framework for grasp-manipulation-release class household operations (GMR-operations). The framework maps human demonstrations to predefined task models through one-shot teaching. Each task model contains both high-level knowledge regarding the geometric constraints and low-level knowledge related to human postures. The key idea is to design a task model that 1) covers various GMR-operations and 2) includes human postures to achieve tasks. We verify the applicability of our framework by testing an operational LfO system with a real robot. In addition, we quantify the coverage of the task model by analyzing online videos of household operations. In the context of one-shot robot teaching, the contribution of this study is a framework that 1) covers various GMR-operations and 2) mimics human postures during the operations.
ROJul 2, 2020
Control of Walking Assist Exoskeleton with Time-delay Based on the Prediction of Plantar ForceMing Ding, Mikihisa Nagashima, Sung-Gwi Cho et al.
Many kinds of lower-limb exoskeletons were developed for walking assistance. However, when controlling these exoskeletons, time-delay due to the computation time and the communication delays is still a general problem. In this research, we propose a novel method to prevent the time-delay when controlling a walking assist exoskeleton by predicting the future plantar force and walking status. By using Long Short-Term Memory and a fully-connected network, the plantar force can be predicted using only data measured by inertial measurement unit sensors, not only during the walking period but also at the start and end of walking. From the predicted plantar force, the walking status and the desired assistance timing can also be determined. By considering the time-delay and sending the control commands beforehand, the exoskeleton can be moved precisely on the desired assistance timing. In experiments, the prediction accuracy of the plantar force and the assistance timing are confirmed. The performance of the proposed method is also evaluated by using the trained model to control the exoskeleton.
CVNov 22, 2018
Multi-View Inpainting for RGB-D SequenceFeiran Li, Gustavo Alfonso Garcia Ricardez, Jun Takamatsu et al.
In this work we propose a novel approach to remove undesired objects from RGB-D sequences captured with freely moving cameras, which enables static 3D reconstruction. Our method jointly uses existing information from multiple frames as well as generates new one via inpainting techniques. We use balanced rules to select source frames; local homography based image warping method for alignment and Markov random field (MRF) based approach for combining existing information. For the left holes, we employ exemplar based multi-view inpainting method to deal with the color image and coherently use it as guidance to complete the depth correspondence. Experiments show that our approach is qualified for removing the undesired objects and inpainting the holes.