TacVLA: Contact-Aware Tactile Fusion for Robust Vision-Language-Action Manipulation
This work addresses robustness issues in robotic manipulation for tasks with visual limitations, offering incremental improvements to existing VLA models.
The paper tackles the problem of suboptimal performance in robotic manipulation tasks involving visual occlusion, fine-grained manipulation, and physical contact by proposing TacVLA, a fine-tuned Vision-Language-Action model that incorporates tactile modalities with a contact-aware gating mechanism, resulting in improvements such as a 20% average success rate increase in disassembly, 60% in in-box picking, and 2.1x better performance in visually occluded scenarios.
Vision-Language-Action (VLA) models have demonstrated significant advantages in robotic manipulation. However, their reliance on vision and language often leads to suboptimal performance in tasks involving visual occlusion, fine-grained manipulation, and physical contact. To address these challenges, we propose TacVLA, a fine-tuned VLA model by incorporating tactile modalities into the transformer-based policy to enhance fine-grained manipulation capabilities. Specifically, we introduce a contact-aware gating mechanism that selectively activates tactile tokens only when contact is detected, enabling adaptive multimodal fusion while avoiding irrelevant tactile interference. The fused visual, language, and tactile tokens are jointly processed within the transformer architecture to strengthen cross-modal grounding during contact-rich interaction. Extensive experiments on constraint-locked disassembly, in-box picking and robustness evaluations demonstrate that our model outperforms baselines, improving the performance by averaging 20% success rate in disassembly, 60% in in-box picking and 2.1x improvement in scenarios with visual occlusion. Videos are available at https://sites.google.com/view/tacvla and code will be released.