ROAILGSep 27, 2025

Multi-Modal Manipulation via Multi-Modal Policy Consensus

arXiv:2509.23468v29 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of effectively combining modalities like vision and touch for robotic manipulation, offering an incremental improvement over existing methods.

The paper tackles the problem of suboptimal integration of diverse sensory modalities in robotic manipulation by proposing a method that factorizes the policy into specialized diffusion models and uses a router network for adaptive combination, resulting in significant outperformance over feature-concatenation baselines on tasks like occluded object picking and puzzle insertion.

Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes