AICVApr 17, 2025

Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning

arXiv:2504.12680v132 citationsh-index: 28MM
Originality Incremental advance
AI Analysis

This addresses the challenge of embodied spatial reasoning for AI systems, offering a computationally efficient approach that is incremental in combining existing methods.

The paper tackles the problem of enabling pretrained models to acquire high-level spatial reasoning from sequential visual observations by introducing Embodied-R, a collaborative framework that combines Vision-Language Models for perception and Language Models for reasoning with Reinforcement Learning. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art models like OpenAI-o1 and Gemini-2.5-pro on both in-distribution and out-of-distribution embodied spatial reasoning tasks.

Humans can perceive and reason about spatial relationships from sequential visual observations, such as egocentric video streams. However, how pretrained models acquire such abilities, especially high-level reasoning, remains unclear. This paper introduces Embodied-R, a collaborative framework combining large-scale Vision-Language Models (VLMs) for perception and small-scale Language Models (LMs) for reasoning. Using Reinforcement Learning (RL) with a novel reward system considering think-answer logical consistency, the model achieves slow-thinking capabilities with limited computational resources. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models (OpenAI-o1, Gemini-2.5-pro) on both in-distribution and out-of-distribution embodied spatial reasoning tasks. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration. We further explore research questions including response length, training on VLM, strategies for reward design, and differences in model generalization after SFT (Supervised Fine-Tuning) and RL training.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes