Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence
This work addresses the gap in evaluating LLMs for embodied AI tasks, highlighting their limitations in multi-step spatial planning and social understanding, which is incremental as it builds on existing hierarchical robotic architectures.
The paper tackles the problem of evaluating large language model (LLM) controlled robots for practical intelligence in the physical world, finding that humans significantly outperform LLMs, with the best LLMs scoring 40% compared to a mean human score of 95%.
We present Butter-Bench, a benchmark evaluating large language model (LLM) controlled robots for practical intelligence, defined as the ability to navigate the messiness of the physical world. Current state-of-the-art robotic systems use a hierarchical architecture with LLMs in charge of high-level reasoning, and a Vision Language Action (VLA) model for low-level control. Butter-Bench evaluates the LLM part in isolation from the VLA. Although LLMs have repeatedly surpassed humans in evaluations requiring analytical intelligence, we find humans still outperform LLMs on Butter-Bench. The best LLMs score 40% on Butter-Bench, while the mean human score is 95%. LLMs struggled the most with multi-step spatial planning and social understanding. We also evaluate LLMs that are fine-tuned for embodied reasoning and conclude that this training does not improve their score on Butter-Bench.