ROMay 25

Extending Embodied Question Answering from Perception to Decision

arXiv:2605.2581374.8
AI Analysis

For embodied AI researchers, this provides a unified large-scale benchmark to evaluate perception, reasoning, and decision-making, addressing fragmentation in existing datasets.

The authors present EQA-Decision, a large-scale embodied QA dataset with over 4 million question-answer pairs covering four reasoning dimensions, and a baseline model RoboDecision. The dataset benchmarks and improves VLM performance in spatial and interaction reasoning.

Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes