70.8ROMay 27
Whose Is This?: Context-Aware Object Ownership Inference with Uncertainty-Guided QuestioningSaki Hashimoto, Akira Taniguchi, Shoichi Hasegawa et al.
Service robots must infer object ownership to correctly interpret instructions such as "bring me my cup." However, ownership is a latent attribute that cannot be directly observed, and existing methods often rely on limited cues such as recent usage, making them unreliable in scenarios such as temporary sharing. We propose a framework for context-aware ownership inference with uncertainty-guided interaction (COIN). The method integrates user background information and object usage history using a large language model (LLM) to estimate ownership scores. To handle uncertainty, we apply conformal prediction to construct a set of plausible owners and selectively generate user queries when the prediction is uncertain. Experiments in a simulated home environment show that the proposed method consistently outperforms baseline approaches, achieving a Subset Accuracy of 0.988 and a Mean Jaccard index of 0.991. The method also maintains high performance in scenarios involving temporary use and shared ownership. The results demonstrate that combining contextual reasoning with uncertainty-aware interaction improves both estimation accuracy and robustness. The project page is available at https://emergentsystemlabstudent.github.io/COIN/.
47.3ROMar 30
Reducing Mental Workload through On-Demand Human Assistance for Physical Action Failures in LLM-based Multi-Robot CoordinationShoichi Hasegawa, Akira Taniguchi, Lotfi El Hafi et al.
Multi-robot coordination based on large language models (LLMs) has attracted growing attention, since LLMs enable the direct translation of natural language instructions into robot action plans by decomposing tasks and generating high-level plans. However, recovering from physical execution failures remains difficult, and tasks often stagnate due to the repetition of the same unsuccessful actions. While frameworks for remote robot operation using Mixed Reality were proposed, there have been few attempts to implement remote error resolution specifically for physical failures in multi-robot environments. In this study, we propose REPAIR (Robot Execution with Planned And Interactive Recovery), a human-in-the-loop framework that integrates remote error resolution into LLM-based multi-robot planning. In this method, robots execute tasks autonomously; however, when an irrecoverable failure occurs, the LLM requests assistance from an operator, enabling task continuity through remote intervention. Evaluations using a multi-robot trash collection task in a real-world environment confirmed that REPAIR significantly improves task progress (the number of items cleared within a time limit) compared to fully autonomous methods. Furthermore, for easily collectable items, it achieved task progress equivalent to full remote control. The results also suggested that the mental workload on the operator may differ in terms of physical demand and effort. The project website is https://emergentsystemlabstudent.github.io/REPAIR/.
ROAug 22, 2025
Take That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View InstructionsAkira Oyama, Shoichi Hasegawa, Akira Taniguchi et al.
Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,'' even when objects or users are out of the robot's view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment and estimates candidate objects from a linguistic query with the user's skeletal data. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT-4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning. The project website is https://emergentsystemlabstudent.github.io/MIEL/.
ROSep 16, 2025
Toward Ownership Understanding of Objects: Active Question Generation with Large Language Model and Probabilistic Generative ModelSaki Hashimoto, Shoichi Hasegawa, Tomochika Ishikawa et al.
Robots operating in domestic and office environments must understand object ownership to correctly execute instructions such as ``Bring me my cup.'' However, ownership cannot be reliably inferred from visual features alone. To address this gap, we propose Active Ownership Learning (ActOwL), a framework that enables robots to actively generate and ask ownership-related questions to users. ActOwL employs a probabilistic generative model to select questions that maximize information gain, thereby acquiring ownership knowledge efficiently to improve learning efficiency. Additionally, by leveraging commonsense knowledge from Large Language Models (LLM), objects are pre-classified as either shared or owned, and only owned objects are targeted for questioning. Through experiments in a simulated home environment and a real-world laboratory setting, ActOwL achieved significantly higher ownership clustering accuracy with fewer questions than baseline methods. These findings demonstrate the effectiveness of combining active inference with LLM-guided commonsense reasoning, advancing the capability of robots to acquire ownership knowledge for practical and socially appropriate task execution.
ROSep 16, 2025
Multi-Robot Task Planning for Multi-Object Retrieval Tasks with Distributed On-Site Knowledge via Large Language ModelsKento Murata, Shoichi Hasegawa, Tomochika Ishikawa et al.
It is crucial to efficiently execute instructions such as "Find an apple and a banana" or "Get ready for a field trip," which require searching for multiple objects or understanding context-dependent commands. This study addresses the challenging problem of determining which robot should be assigned to which part of a task when each robot possesses different situational on-site knowledge-specifically, spatial concepts learned from the area designated to it by the user. We propose a task planning framework that leverages large language models (LLMs) and spatial concepts to decompose natural language instructions into subtasks and allocate them to multiple robots. We designed a novel few-shot prompting strategy that enables LLMs to infer required objects from ambiguous commands and decompose them into appropriate subtasks. In our experiments, the proposed method achieved 47/50 successful assignments, outperforming random (28/50) and commonsense-based assignment (26/50). Furthermore, we conducted qualitative evaluations using two actual mobile manipulators. The results demonstrated that our framework could handle instructions, including those involving ad hoc categories such as "Get ready for a field trip," by successfully performing task decomposition, assignment, sequential planning, and execution.
ROApr 15, 2024
Real-world Instance-specific Image Goal Navigation: Bridging Domain Gaps via Contrastive LearningTaichi Sakaguchi, Akira Taniguchi, Yoshinobu Hagiwara et al.
Improving instance-specific image goal navigation (InstanceImageNav), which locates the identical object in a real-world environment from a query image, is essential for robotic systems to assist users in finding desired objects. The challenge lies in the domain gap between low-quality images observed by the moving robot, characterized by motion blur and low-resolution, and high-quality query images provided by the user. Such domain gaps could significantly reduce the task success rate but have not been the focus of previous work. To address this, we propose a novel method called Few-shot Cross-quality Instance-aware Adaptation (CrossIA), which employs contrastive learning with an instance classifier to align features between massive low- and few high-quality images. This approach effectively reduces the domain gap by bringing the latent representations of cross-quality images closer on an instance basis. Additionally, the system integrates an object image collection with a pre-trained deblurring model to enhance the observed image quality. Our method fine-tunes the SimSiam model, pre-trained on ImageNet, using CrossIA. We evaluated our method's effectiveness through an InstanceImageNav task with 20 different types of instances, where the robot identifies the same instance in a real-world environment as a high-quality query image. Our experiments showed that our method improves the task success rate by up to three times compared to the baseline, a conventional approach based on SuperGlue. These findings highlight the potential of leveraging contrastive learning and image enhancement techniques to bridge the domain gap and improve object localization in robotic applications. The project website is https://emergentsystemlabstudent.github.io/DomainBridgingNav/.
HCJun 27, 2019
Sensitivity to Haptic-Audio Envelope AsynchronyAlfonso Balandra, Shoichi Hasegawa
We want to understand the human capabilities to perceive amplitude similarities between a haptic and an audio signal. So, four psychophysical experiments were performed. Three of them measured the asynchrony JND (Just Noticeable Difference) at the signals' attack, release and decay, while the forth experiment measured the amplitude decrease on the middle of the signal. All the experiments used a combination of the constant stimulus and staircase methods to present two stimuli, while the participants' (N=12) task was to identify which of the two stimuli was synchronized. The audiotactile stimulus was defined using an stereo audio signal with an ADSR (Attack Decay Sustain Release) envelope. The partial results reveal JNDs for temporal asynchrony of: 54ms for attack, 265ms for decay and 57ms for release. Also the results reveal an amplitude decrease JND of 25\%. Although for decay the results were to disperse, therefore we suspect that the participants were not able to the changes on the haptic signal.