ROMay 13

ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation

arXiv:2506.1595338.032 citationsh-index: 8
Predicted impact top 7% in RO · last 90 daysOriginality Highly original
AI Analysis

This work addresses the challenge of precise, long-horizon dexterous manipulation for robotic systems by integrating visuo-tactile representation learning, achieving a substantial performance leap over existing methods.

ViTacFormer fuses vision and tactile sensing via cross-attention and autoregressive tactile prediction, achieving ~50% higher success rates than prior SOTA on dexterous manipulation benchmarks, and is the first to autonomously complete long-horizon tasks with up to 11 stages and 2.5 minutes of continuous operation.

Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes