CVMay 26

Touch-R1: Reinforcing Touch Reasoning in MLLMs

Yingxin Lai, Yafei Zhou, Fucai Zhu, Siyu Zhu, Weihao Yuan

arXiv:2605.2715484.5

Predicted impact top 22% in CV · last 90 daysOriginality Highly original

AI Analysis

For researchers in multimodal AI and robotics, this work addresses the underexplored problem of tactile reasoning, enabling models to ground predictions in physical evidence and resolve visual-tactile conflicts.

Touch-R1 introduces a tactile reasoning MLLM trained with a tactile-grounded GRPO objective, outperforming Octopi-13B by 18.4% and GPT-4o by 24.7% on the new TouchReason-Bench benchmark.

While rule-based reinforcement learning has recently catalyzed explicit reasoning in multimodal models, tactile reasoning remains largely underexplored. Existing tactile-language models primarily rely on supervised or contrastive objectives, which limits their capacity to ground predictions in physical evidence or rectify misleading visual priors. Tactile reasoning introduces two modality-specific challenges: the ordinal nature of physical attributes (e.g., hardness, roughness) and the cross-sensor distribution shifts inherent in optical tactile hardware. In this work, we introduce TouchReason-1M, a large-scale multimodal dataset comprising over 1M synchronized tactile pairs across four distinct sensors, and TouchReason-Bench, a rigorous framework for evaluating tactile perception and visual-tactile conflict resolution. Building upon these, we propose Touch-R1, a tactile reasoning MLLM based on Qwen2.5-VL-7B. Touch-R1 is trained via a tactile-grounded GRPO objective that combines ordinal-aware accuracy, cross-sensor physical consistency, structured-format control, and an input-side tactile grounding objective. Specifically, the tactile-use reward assigns credit only when authentic tactile inputs yield superior correctness relative to counterfactual controls where the tactile stream is removed, shuffled, or noise-masked. On TouchReason-Bench, Touch-R1-7B outperforms Octopi-13B by 18.4\% and GPT-4o by 24.7\% on average. Its structured reasoning traces reveal emergent behaviors of probing, comparison, and revision, demonstrating that R1-style reasoning can be effectively grounded in physical contact.

View on arXiv PDF

Similar