ROCVAug 24, 2023

HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt interaction tasks

arXiv:2308.12537v17 citationsh-index: 11Has Code
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of enabling robots to understand and act on human instructions in real-world scenarios, though it appears incremental as it builds on existing vision-language models for a specific domain.

The authors tackled the challenge of bridging human natural language and robot perception by proposing HuBo-VLM, a unified vision-language model for human-robot interaction tasks like object detection and visual grounding, achieving effectiveness as demonstrated on the Talk2Car benchmark.

Human robot interaction is an exciting task, which aimed to guide robots following instructions from human. Since huge gap lies between human natural language and machine codes, end to end human robot interaction models is fair challenging. Further, visual information receiving from sensors of robot is also a hard language for robot to perceive. In this work, HuBo-VLM is proposed to tackle perception tasks associated with human robot interaction including object detection and visual grounding by a unified transformer based vision language model. Extensive experiments on the Talk2Car benchmark demonstrate the effectiveness of our approach. Code would be publicly available in https://github.com/dzcgaara/HuBo-VLM.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes