RO AIOct 23, 2025

Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence

Callum Sharrock, Lukas Petersson, Hanna Petersson, Axel Backlund, Axel Wennström, Kristoffer Nordström, Elias Aronsson

arXiv:2510.21860v11 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This work addresses the gap in evaluating LLMs for embodied AI tasks, highlighting their limitations in multi-step spatial planning and social understanding, which is incremental as it builds on existing hierarchical robotic architectures.

The paper tackles the problem of evaluating large language model (LLM) controlled robots for practical intelligence in the physical world, finding that humans significantly outperform LLMs, with the best LLMs scoring 40% compared to a mean human score of 95%.

We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench.

View on arXiv PDF

Similar