AI CL CVFeb 13, 2025

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang

arXiv:2502.09560v343.7179 citationsh-index: 16Has CodeICML

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of evaluating MLLM-based embodied agents for researchers and developers, though it is incremental as it focuses on benchmarking rather than proposing new methods.

The authors tackled the lack of comprehensive evaluation frameworks for Multi-modal Large Language Models (MLLMs) in vision-driven embodied agents by introducing EmbodiedBench, a benchmark with 1,128 tasks across four environments, and found that MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model scoring only 28.9% on average.

Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9\% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available at https://embodiedbench.github.io.

View on arXiv PDF

Similar