Xiaohai Hu

RO
h-index11
4papers
178citations
Novelty56%
AI Score45

4 Papers

CLOct 9, 2023
GraphLLM: Boosting Graph Reasoning Ability of Large Language Model

Ziwei Chai, Tianjie Zhang, Liang Wu et al. · cmu

The advancement of Large Language Models (LLMs) has remarkably pushed the boundaries towards artificial general intelligence (AGI), with their exceptional ability on understanding diverse types of information, including but not limited to images and audio. Despite this progress, a critical gap remains in empowering LLMs to proficiently understand and reason on graph data. Recent studies underscore LLMs' underwhelming performance on fundamental graph reasoning tasks. In this paper, we endeavor to unearth the obstacles that impede LLMs in graph reasoning, pinpointing the common practice of converting graphs into natural language descriptions (Graph2Text) as a fundamental bottleneck. To overcome this impediment, we introduce GraphLLM, a pioneering end-to-end approach that synergistically integrates graph learning models with LLMs. This synergy equips LLMs with the ability to proficiently interpret and reason on graph data, harnessing the superior expressive power of graph learning models. Our empirical evaluations across four fundamental graph reasoning tasks validate the effectiveness of GraphLLM. The results exhibit a substantial average accuracy enhancement of 54.44%, alongside a noteworthy context reduction of 96.45% across various graph reasoning tasks.

ROMar 2, 2023
Learning to Detect Slip through Tactile Estimation of the Contact Force Field and its Entropy

Xiaohai Hu, Aparajit Venkatesh, Yusen Wan et al.

Detection of slip during object grasping and manipulation plays a vital role in object handling. Existing solutions primarily rely on visual information to devise a strategy for grasping. However, for robotic systems to attain a level of proficiency comparable to humans, especially in consistently handling and manipulating unfamiliar objects, integrating artificial tactile sensing is increasingly essential. We introduce a novel physics-informed, data-driven approach to detect slip continuously in real time. We employ the GelSight Mini, an optical tactile sensor, attached to custom-designed grippers to gather tactile data. Our work leverages the inhomogeneity of tactile sensor readings during slip events to develop distinctive features and formulates slip detection as a classification problem. To evaluate our approach, we test multiple data-driven models on 10 common objects under different loading conditions, textures, and materials. Our results show that the best classification algorithm achieves a high average accuracy of 95.61%. We further illustrate the practical application of our research in dynamic robotic manipulation tasks, where our real-time slip detection and prevention algorithm is implemented.

68.5ROMar 16
MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

Xunlan Zhou, Xuanlin Chen, Shaowei Zhang et al.

Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.

CVMay 22, 2024
Context and Geometry Aware Voxel Transformer for Semantic Scene Completion

Zhu Yu, Runmin Zhang, Jiacheng Ying et al.

Vision-based Semantic Scene Completion (SSC) has gained much attention due to its widespread applications in various 3D perception tasks. Existing sparse-to-dense approaches typically employ shared context-independent queries across various input images, which fails to capture distinctions among them as the focal regions of different inputs vary and may result in undirected feature aggregation of cross-attention. Additionally, the absence of depth information may lead to points projected onto the image plane sharing the same 2D position or similar sampling points in the feature map, resulting in depth ambiguity. In this paper, we present a novel context and geometry aware voxel transformer. It utilizes a context aware query generator to initialize context-dependent queries tailored to individual input images, effectively capturing their unique characteristics and aggregating information within the region of interest. Furthermore, it extend deformable cross-attention from 2D to 3D pixel space, enabling the differentiation of points with similar image coordinates based on their depth coordinates. Building upon this module, we introduce a neural network named CGFormer to achieve semantic scene completion. Simultaneously, CGFormer leverages multiple 3D representations (i.e., voxel and TPV) to boost the semantic and geometric representation abilities of the transformed 3D volume from both local and global perspectives. Experimental results demonstrate that CGFormer achieves state-of-the-art performance on the SemanticKITTI and SSCBench-KITTI-360 benchmarks, attaining a mIoU of 16.87 and 20.05, as well as an IoU of 45.99 and 48.07, respectively. Remarkably, CGFormer even outperforms approaches employing temporal images as inputs or much larger image backbone networks.