RO CV LGJan 25, 2024

MResT: Multi-Resolution Sensing for Real-Time Control with Vision-Language Models

Saumya Saxena, Mohit Sharma, Oliver Kroemer

arXiv:2401.14502v18.36 citationsCoRL

Originality Incremental advance

AI Analysis

This work addresses real-time control challenges in robotics, offering a domain-specific solution for manipulation tasks with visual and geometric variations.

The paper tackles the problem of improving robotic manipulation performance by leveraging multi-resolution sensing across spatial and temporal scales, resulting in a 2X average improvement over baselines in tasks like coarse, precise, and dynamic manipulation.

Leveraging sensing modalities across diverse spatial and temporal resolutions can improve performance of robotic manipulation tasks. Multi-spatial resolution sensing provides hierarchical information captured at different spatial scales and enables both coarse and precise motions. Simultaneously multi-temporal resolution sensing enables the agent to exhibit high reactivity and real-time control. In this work, we propose a framework, MResT (Multi-Resolution Transformer), for learning generalizable language-conditioned multi-task policies that utilize sensing at different spatial and temporal resolutions using networks of varying capacities to effectively perform real time control of precise and reactive tasks. We leverage off-the-shelf pretrained vision-language models to operate on low-frequency global features along with small non-pretrained models to adapt to high frequency local feedback. Through extensive experiments in 3 domains (coarse, precise and dynamic manipulation tasks), we show that our approach significantly improves (2X on average) over recent multi-task baselines. Further, our approach generalizes well to visual and geometric variations in target objects and to varying interaction forces.

View on arXiv PDF

Similar