CVROMar 14, 2024

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

arXiv:2403.09813v39 citationsPoster Volume Ⅰ The 2024 Twentieth International Conference on Intelligent Computing August 5-8, 2024 Tianjin, China
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of limited language exploration in touch-based multimodal perception for robotics and AI, though it appears incremental by extending existing datasets and methods.

The authors tackled the lack of language integration in multimodal touch research by constructing the TLV dataset with sentence-level descriptions, and fine-tuned their STLV-Align framework to achieve effective semantic alignment with only 1% parameter adjustments.

Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes