ROAICVLGMay 14, 2024

Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation

arXiv:2405.08576v126 citationsh-index: 18ICRA
Originality Incremental advance
AI Analysis

This addresses the problem of low-data regimes in robotics by enabling pretraining for tactile sensing, though it is incremental as it adapts existing audio-visual methods to a new application.

The paper tackles the lack of large-scale pretraining for non-visual modalities in robotics by using contact microphones as tactile sensors, leveraging audio-visual pretraining to improve robotic manipulation performance, achieving a 30% increase in success rates on contact-rich tasks.

Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach leveraging large-scale multisensory pre-training for robotic manipulation. For supplementary information including videos of real robot experiments, please see https://sites.google.com/view/hearing-touch.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes