ROSep 22, 2021

Audio-Visual Grounding Referring Expression for Robotic Manipulation

arXiv:2109.10571v118 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of precise object manipulation in robotics for human-robot interaction, though it is incremental as it builds on existing audio-visual grounding methods.

The paper tackles the problem of enabling robots to understand referring expressions in manipulation instructions by leveraging both audio and visual information, resulting in improved performance compared to using visual data alone.

Referring expressions are commonly used when referring to a specific target in people's daily dialogue. In this paper, we develop a novel task of audio-visual grounding referring expression for robotic manipulation. The robot leverages both the audio and visual information to understand the referring expression in the given manipulation instruction and the corresponding manipulations are implemented. To solve the proposed task, an audio-visual framework is proposed for visual localization and sound recognition. We have also established a dataset which contains visual data, auditory data and manipulation instructions for evaluation. Finally, extensive experiments are conducted both offline and online to verify the effectiveness of the proposed audio-visual framework. And it is demonstrated that the robot performs better with the audio-visual data than with only the visual data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes