CV ROSep 23, 2024

ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models

Sombit Dey, Jan-Nico Zaech, Nikolay Nikolov, Luc Van Gool, Danda Pani Paudel

arXiv:2409.15250v316.826 citationsh-index: 30

Originality Incremental advance

AI Analysis

This work addresses visual generalization issues in robotic foundation models, which is an incremental improvement for robotics applications.

The paper tackled the problem of visual domain limitations in robotic foundation models, showing that existing models lack robustness to out-of-domain scenarios due to catastrophic forgetting, and proposed a gradual backbone reversal method that improved performance by 77% and 66% in grasping and lifting tasks.

Recent progress in large language models and access to large-scale robotic datasets has sparked a paradigm shift in robotics models transforming them into generalists able to adapt to various tasks, scenes, and robot modalities. A large step for the community are open Vision Language Action models which showcase strong performance in a wide variety of tasks. In this work, we study the visual generalization capabilities of three existing robotic foundation models, and propose a corresponding evaluation framework. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. This is potentially caused by limited variations in the training data and/or catastrophic forgetting, leading to domain limitations in the vision foundation models. We further explore OpenVLA, which uses two pre-trained vision foundation models and is, therefore, expected to generalize to out-of-domain experiments. However, we showcase catastrophic forgetting by DINO-v2 in OpenVLA through its failure to fulfill the task of depth regression. To overcome the aforementioned issue of visual catastrophic forgetting, we propose a gradual backbone reversal approach founded on model merging. This enables OpenVLA -- which requires the adaptation of the visual backbones during initial training -- to regain its visual generalization ability. Regaining this capability enables our ReVLA model to improve over OpenVLA by a factor of 77\% and 66\% for grasping and lifting in visual OOD tasks. Comprehensive evaluations, episode rollouts and model weights are available on the ReVLA Page

View on arXiv PDF

Similar